Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Haskell Data Analysis cookbook
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook: Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes

eBook
$24.99 $36.99
Paperback
$60.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Haskell Data Analysis cookbook

Chapter 1. The Hunt for Data

In this chapter, we will cover the following recipes:

  • Harnessing data from various sources
  • Accumulating text data from a file path
  • Catching I/O code faults
  • Keeping and representing data from a CSV file
  • Examining a JSON file with the aeson package
  • Reading an XML file using the HXT package
  • Capturing table rows from an HTML page
  • Understanding how to perform HTTP GET requests
  • Learning how to perform HTTP POST requests
  • Traversing online directories for data
  • Using MongoDB queries in Haskell
  • Reading from a remote MongoDB server
  • Exploring data from a SQLite database

Introduction

Introduction

Data is everywhere, logging is cheap, and analysis is inevitable. One of the most fundamental concepts of this chapter is based on gathering useful data. After building a large collection of usable text, which we call the corpus, we must learn to represent this content in code. The primary focus will be first on obtaining data and later on enumerating ways of representing it.

Gathering data is arguably as important as analyzing it to extrapolate results and form valid generalizable claims. It is a scientific pursuit; therefore, great care must and will be taken to ensure unbiased and representative sampling. We recommend following along closely in this chapter because the remainder of the book depends on having a source of data to work with. Without data, there isn't much to analyze, so we should carefully observe the techniques laid out to build our own formidable corpus.

The first recipe enumerates various sources to start gathering data online. The next few recipes deal with using local data of different file formats. We then learn how to download data from the Internet using our Haskell code. Finally, we finish this chapter with a couple of recipes on using databases in Haskell.

Harnessing data from various sources

Information can be described as structured, unstructured, or sometimes a mix of the two—semi-structured.

In a very general sense, structured data is anything that can be parsed by an algorithm. Common examples include JSON, CSV, and XML. If given structured data, we can design a piece of code to dissect the underlying format and easily produce useful results. As mining structured data is a deterministic process, it allows us to automate the parsing. This in effect lets us gather more input to feed our data analysis algorithms.

Unstructured data is everything else. It is data not defined in a specified manner. Written languages such as English are often regarded as unstructured because of the difficulty in parsing a data model out of a natural sentence.

In our search for good data, we will often find a mix of structured and unstructured text. This is called semi-structured text.

This recipe will primarily focus on obtaining structured and semi-structured data from the following sources.

Tip

Unlike most recipes in this book, this recipe does not contain any code. The best way to read this book is by skipping around to the recipes that interest you.

How to do it...

We will browse through the links provided in the following sections to build up a list of sources to harness interesting data in usable formats. However, this list is not at all exhaustive.

Some of these sources have an Application Programming Interface (API) that allows more sophisticated access to interesting data. An API specifies the interactions and defines how data is communicated.

News

The New York Times has one of the most polished API documentation to access anything from real-estate data to article search results. This documentation can be found at http://developer.nytimes.com.

The Guardian also supports a massive datastore with over a million articles at http://www.theguardian.com/data.

USA TODAY provides some interesting resources on books, movies, and music reviews. The technical documentation can be found at http://developer.usatoday.com.

The BBC features some interesting API endpoints including information on BBC programs, and music located at http://www.bbc.co.uk/developer/technology/apis.html.

Private

Facebook, Twitter, Instagram, Foursquare, Tumblr, SoundCloud, Meetup, and many other social networking sites support APIs to access some degree of social information.

For specific APIs such as weather or sports, Mashape is a centralized search engine to narrow down the search to some lesser-known sources. Mashape is located at https://www.mashape.com/

Most data sources can be visualized using the Google Public Data search located at http://www.google.com/publicdata.

For a list of all countries with names in various data formats, refer to the repository located at https://github.com/umpirsky/country-list.

Academic

Some data sources are hosted openly by universities around the world for research purposes.

To analyze health care data, the University of Washington has published Institute for Health Metrics and Evaluation (IHME) to collect rigorous and comparable measurement of the world's most important health problems. Navigate to http://www.healthdata.org for more information.

The MNIST database of handwritten digits from NYU, Google Labs, and Microsoft Research is a training set of normalized and centered samples for handwritten digits. Download the data from http://yann.lecun.com/exdb/mnist.

Nonprofits

Human Development Reports publishes annual updates ranging from international data about adult literacy to the number of people owning personal computers. It describes itself as having a variety of public international sources and represents the most current statistics available for those indicators. More information is available at http://hdr.undp.org/en/statistics.

The World Bank is the source for poverty and world development data. It regards itself as a free source that enables open access to data about development in countries around the globe. Find more information at http://data.worldbank.org/.

The World Health Organization provides data and analyses for monitoring the global health situation. See more information at http://www.who.int/research/en.

UNICEF also releases interesting statistics, as the quote from their website suggests:

"The UNICEF database contains statistical tables for child mortality, diseases, water sanitation, and more vitals. UNICEF claims to play a central role in monitoring the situation of children and women—assisting countries in collecting and analyzing data, helping them develop methodologies and indicators, maintaining global databases, disseminating and publishing data. Find the resources at http://www.unicef.org/statistics."

The United Nations hosts interesting publicly available political statistics at http://www.un.org/en/databases.

The United States government

If we crave the urge to discover patterns in the United States (U.S.) government like Nicholas Cage did in the feature film National Treasure (2004), then http://www.data.gov/ is our go-to source. It's the U.S. government's active effort to provide useful data. It is described as a place to increase "public access to high-value, machine-readable datasets generated by the executive branch of the Federal Government". Find more information at http://www.data.gov.

The United States Census Bureau releases population counts, housing statistics, area measurements, and more. These can be found at http://www.census.gov.

Accumulating text data from a file path

One of the easiest ways to get started with processing input is by reading raw text from a local file. In this recipe, we will be extracting all the text from a specific file path. Furthermore, to do something interesting with the data, we will count the number of words per line.

Tip

Haskell is a purely functional programming language, right? Sure, but obtaining input from outside the code introduces impurity. For elegance and reusability, we must carefully separate pure from impure code.

Getting ready

We will first create an input.txt text file with a couple of lines of text to be read by the program. We keep this file in an easy-to-access directory because it will be referenced later. For example, the text file we're dealing with contains a seven-line quote by Plato. Here's what our terminal prints when we issue the following command:

$ cat input.txt

And how will you inquire, Socrates,
into that which you know not? 
What will you put forth as the subject of inquiry? 
And if you find what you want, 
how will you ever know that 
this is what you did not know?

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code will also be hosted on GitHub at https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook.

How to do it...

Create a new file to start coding. We call our file Main.hs.

  1. As with all executable Haskell programs, start by defining and implementing the main function, as follows:
    main :: IO ()
    main = do
    
  2. Use Haskell's readFile :: FilePath -> IO String function to extract data from an input.txt file path. Note that a file path is just a synonym for String. With the string in memory, pass it into a countWords function to count the number of words in each line, as shown in the following steps:
    input <- readFile "input.txt"
    print $ countWords input
    
  3. Lastly, define our pure function, countWords, as follows:
    countWords :: String -> [Int]
    countWords input = map (length.words) (lines input)
    
  4. The program will print out the number of words per line represented as a list of numbers as follows:
    $ runhaskell Main.hs
    
    [6,6,10,7,6,7]
    

How it works...

Haskell provides useful input and output (I/O) capabilities for reading input and writing output in different ways. In our case, we use readFile to specify a path of a file to be read. Using the do keyword in main suggests that we are joining several IO actions together. The output of readFile is an I/O string, which means it is an I/O action that returns a String type.

Now we're about to get a bit technical. Pay close attention. Alternatively, smile and nod. In Haskell, the I/O data type is an instance of something called a Monad. This allows us to use the <- notation to draw the string out of this I/O action. We then make use of the string by feeding it into our countWords function that counts the number of words in each line. Notice how we separated the countWords function apart from the impure main function.

Finally, we print the output of countWords. The $ notation means we are using a function application to avoid excessive parenthesis in our code. Without it, the last line of main would look like print (countWords input).

See also

For simplicity's sake, this code is easy to read but very fragile. If an input.txt file does not exist, then running the code will immediately crash the program. For example, the following command will generate the error message:

$ runhaskell Main.hs

Main.hs: input.txt: openFile: does not exist…

To make this code fault tolerant, refer to the Catching I/O code faults recipe.

Catching I/O code faults

Making sure our code doesn't crash in the process of data mining or analysis is a substantially genuine concern. Some computations may take hours, if not days. Haskell gifts us with type safety and strong checks to help ensure a program will not fail, but we must also take care to double-check edge cases where faults may occur.

For instance, a program may crash ungracefully if the local file path is not found. In the previous recipe, there was a strong dependency on the existence of input.txt in our code. If the program is unable to find the file, it will produce the following error:

mycode: input.txt: openFile: does not exist (No such file or directory)

Naturally, we should decouple the file path dependency by enabling the user to specify his/her file path as well as by not crashing in the event that the file is not found.

Consider the following revision of the source code.

How to do it…

Create a new file, name it Main.hs, and perform the following steps:

  1. First, import a library to catch fatal errors as follows:
    import Control.Exception (catch, SomeException)
  2. Next, import a library to get command-line arguments so that the file path is dynamic. We use the following line of code to do this:
    import System.Environment (getArgs)
  3. Continuing as before, define and implement main as follows:
    main :: IO ()
    main = do
  4. Define a fileName string depending on the user-provided argument, defaulting to input.txt if there is no argument. The argument is obtained by retrieving an array of strings from the library function, getArgs :: IO [String], as shown in the following steps:
    args <- getArgs
      let filename = case args of
        (a:_) -> a
            _ -> "input.txt"
  5. Now apply readFile on this path, but catch any errors using the library's catch :: Exception e => IO a -> (e -> IO a) -> IO a function. The first argument to catch is the computation to run, and the second argument is the handler to invoke if an exception is raised, as shown in the following commands:
      input <- catch (readFile fileName)
        $ \err -> print (err::SomeException) >> return ""
  6. The input string will be empty if there were any errors reading the file. We can now use input for any purpose using the following command:
      print $ countWords input
  7. Don't forget to define the countWords function as follows:
    countWords input = map (length.words) (lines input)

How it works…

This recipe demonstrates two ways to catch errors, listed as follows:

  • Firstly, we use a case expression that pattern matches against any argument passed in. Therefore, if no arguments are passed, the args list is empty, and the last pattern, "_", is caught, resulting in a default filename of input.txt.
  • Secondly, we use the catch function to handle an error if something goes wrong. When having trouble reading a file, we allow the code to continue running by setting input to an empty string.

There's more…

Conveniently, Haskell also comes with a doesFileExist :: FilePath -> IO Bool function from the System.Directory module. We can simplify the preceding code by modifying the input <- … line. It can be replaced with the following snippet of code:

exists <- doesFileExist filename
input <- if exists then readFile filename else return ""

In this case, the code reads the file as an input only if it exists. Do not forget to add the following import line at the top of the source code:

import System.Directory (doesFileExist)

Keeping and representing data from a CSV file

Comma Separated Value (CSV) is a format to represent a table of values in plain text. It's often used to interact with data from spreadsheets. The specifications for CSV are described in RFC 4180, available at http://tools.ietf.org/html/rfc4180.

In this recipe, we will read a local CSV file called input.csv consisting of various names and their corresponding ages. Then, to do something useful with the data, we will find the oldest person.

Getting ready

Prepare a simple CSV file with a list of names and their corresponding ages. This can be done using a text editor or by exporting from a spreadsheet, as shown in the following figure:

Getting ready

The raw input.csv file contains the following text:

$ cat input.csv 

name,age
Alex,22
Anish,22
Becca,23
Jasdev,22
John,21
Jonathon,21
Kelvin,22
Marisa,19
Shiv,22
Vinay,22

The code also depends on the csv library. We may install the library through Cabal using the following command:

$ cabal install csv

How to do it...

  1. Import the csv library using the following line of code:
    import Text.CSV
  2. Define and implement main, where we will read and parse the CSV file, as shown in the following code:
    main :: IO ()
    main = do
      let fileName = "input.csv"
      input <- readFile fileName
  3. Apply parseCSV to the filename to obtain a list of rows, representing the tabulated data. The output of parseCSV is Either ParseError CSV, so ensure that we consider both the Left and Right cases:
      let csv = parseCSV fileName input
      either handleError doWork csv
    handleError csv = putStrLn "error parsing"
    doWork csv = (print.findOldest.tail) csv
  4. Now we can work with the CSV data. In this example, we find and print the row containing the oldest person, as shown in the following code snippet:
    findOldest :: [Record] -> Record
    findOldest [] = []
    findOldest xs = foldl1
              (\a x -> if age x > age a then x else a) xs
    
    age [a,b] = toInt a
                                   
    toInt :: String -> Int                               
    toInt = read
  5. After running main, the code should produce the following output:
    $ runhaskell Main.hs
    
    ["Becca", "23"]
    

    Tip

    We can also use the parseCSVFromFile function to directly get the CSV representation from a filename instead of using readFile followed parseCSV.

How it works...

The CSV data structure in Haskell is represented as a list of records. Record is merely a list of Fields, and Field is a type synonym for String. In other words, it is a collection of rows representing a table, as shown in the following figure:

How it works...

The parseCSV library function returns an Either type, with the Left side being a ParseError and the Right side being the list of lists. The Either l r data type is very similar to the Maybe a type which has the Just a or Nothing constructor.

We use the either function to handle the Left and Right cases. The Left case handles the error, and the Right case handles the actual work to be done on the data. In this recipe, the Right side is a Record. The fields in Record are accessible through any list operations such as head, last, !!, and so on.

Examining a JSON file with the aeson package

JavaScript Object Notation (JSON) is a way to represent key-value pairs in plain text. The format is described extensively in RFC 4627 (http://www.ietf.org/rfc/rfc4627).

In this recipe, we will parse a JSON description about a person. We often encounter JSON in APIs from web applications.

Getting ready

Install the aeson library from hackage using Cabal.

Prepare an input.json file representing data about a mathematician, such as the one in the following code snippet:

$ cat input.json

{"name":"Gauss", "nationality":"German", "born":1777, "died":1855}

We will be parsing this JSON and representing it as a usable data type in Haskell.

How to do it...

  1. Use the OverloadedStrings language extension to represent strings as ByteString, as shown in the following line of code:
    {-# LANGUAGE OverloadedStrings #-}
  2. Import aeson as well as some helper functions as follows:
    import Data.Aeson
    import Control.Applicative
    import qualified Data.ByteString.Lazy as B
  3. Create the data type corresponding to the JSON structure, as shown in the following code:
    data Mathematician = Mathematician 
                         { name :: String
                         , nationality :: String
                         , born :: Int
                         , died :: Maybe Int
                         } 
  4. Provide an instance for the parseJSON function, as shown in the following code snippet:
    instance FromJSON Mathematician where
      parseJSON (Object v) = Mathematician
                             <$> (v .: "name")
                             <*> (v .: "nationality")
                             <*> (v .: "born")
                             <*> (v .:? "died")
  5. Define and implement main as follows:
    main :: IO ()
    main = do
  6. Read the input and decode the JSON, as shown in the following code snippet:
      input <- B.readFile "input.json"
    
      let mm = decode input :: Maybe Mathematician
    
      case mm of
        Nothing -> print "error parsing JSON"
        Just m -> (putStrLn.greet) m
  7. Now we will do something interesting with the data as follows:
    greet m = (show.name) m ++ 
              " was born in the year " ++ 
              (show.born) m
  8. We can run the code to see the following output:
    $ runhaskell Main.hs
    
    "Gauss" was born in the year 1777
    

How it works...

Aeson takes care of the complications in representing JSON. It creates native usable data out of a structured text. In this recipe, we use the .: and .:? functions provided by the Data.Aeson module.

As the Aeson package uses ByteStrings instead of Strings, it is very helpful to tell the compiler that characters between quotation marks should be treated as the proper data type. This is done in the first line of the code which invokes the OverloadedStrings language extension.

Tip

Language extensions such as OverloadedStrings are currently supported only by the Glasgow Haskell Compiler (GHC).

We use the decode function provided by Aeson to transform a string into a data type. It has the type FromJSON a => B.ByteString -> Maybe a. Our Mathematician data type must implement an instance of the FromJSON typeclass to properly use this function. Fortunately, the only required function for implementing FromJSON is parseJSON. The syntax used in this recipe for implementing parseJSON is a little strange, but this is because we're leveraging applicative functions and lenses, which are more advanced Haskell topics.

The .: function has two arguments, Object and Text, and returns a Parser a data type. As per the documentation, it retrieves the value associated with the given key of an object. This function is used if the key and the value exist in the JSON document. The :? function also retrieves the associated value from the given key of an object, but the existence of the key and value are not mandatory. So, we use .:? for optional key value pairs in a JSON document.

There's more…

If the implementation of the FromJSON typeclass is too involved, we can easily let GHC automatically fill it out using the DeriveGeneric language extension. The following is a simpler rewrite of the code:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
import Data.Aeson
import qualified Data.ByteString.Lazy as B
import GHC.Generics

data Mathematician = Mathematician { name :: String
                                   , nationality :: String
                                   , born :: Int
                                   , died :: Maybe Int
                                   } deriving Generic

instance FromJSON Mathematician

main = do
  input <- B.readFile "input.json"
  let mm = decode input :: Maybe Mathematician
  case mm of
    Nothing -> print "error parsing JSON"
    Just m -> (putStrLn.greet) m
    
greet m = (show.name) m ++" was born in the year "++ (show.born) m

Although Aeson is powerful and generalizable, it may be an overkill for some simple JSON interactions. Alternatively, if we wish to use a very minimal JSON parser and printer, we can use Yocto, which can be downloaded from http://hackage.haskell.org/package/yocto.

Reading an XML file using the HXT package

Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/).

In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates.

Getting ready

We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows:

$ cat input.xml

<thread>
    <email>
        <to>Databender</to>
        <from>Princess</from>
        <date>Thu Dec 18 15:03:23 EST 2014</date>
        <subject>Joke</subject>
        <body>Why did you divide sin by tan?</body>
    </email>
    <email>
        <to>Princess</to>
        <from>Databender</from>
        <date>Fri Dec 19 3:12:00 EST 2014</date>
        <subject>RE: Joke</subject>
        <body>Just cos.</body>
    </email>
</thread>

Using Cabal, install the HXT library which we use for manipulating XML documents:

$ cabal install hxt

How to do it...

  1. We only need one import, which will be for parsing XML, using the following line of code:
    import Text.XML.HXT.Core
  2. Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code:
    main :: IO ()
    main = do
        input <- readFile "input.xml"
  3. Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet:
        dates <- runX $ readString [withValidate no] input 
            //> hasName "date" 
            //> getText
  4. We can now use the list of extracted dates as follows:
        print dates
  5. By running the code, we print the following output:
     $ runhaskell Main.hs
    
    ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"]
    

How it works...

The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data.

For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text.

As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following:

  • isText: This is used to test for text nodes
  • isAttr: This is used to test for an attribute tree
  • hasAttr: This is used to test whether an element node has an attribute node with a specific name
  • getElemName: This is used to select the name of an element node

All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html.

Capturing table rows from an HTML page

Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process.

In this recipe, we will find a table on a web page and gather all rows to be used in the program.

Getting ready

We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure:

Getting ready

The HTML behind this table is as follows:

$ cat input.html

<!DOCTYPE html>
<html>
    <body>
        <h1>Course Listing</h1>
        <table>
            <tr>
                <th>Course</th>
                <th>Time</th>
                <th>Capacity</th>
            </tr>
            <tr>
                <td>CS 1501</td>
                <td>17:00</td>
                <td>60</td>
            </tr>
            <tr>
                <td>MATH 7600</td>
                <td>14:00</td>
                <td>25</td>
            </tr>
            <tr>
                <td>PHIL 1000</td>
                <td>9:30</td>
                <td>120</td>
            </tr>
        </table>
    </body>
</html>

If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines:

$ cabal install hxt
$ cabal install split

How to do it...

  1. We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet:
    import Text.XML.HXT.Core
    import Data.List.Split (chunksOf)
  2. Define and implement main to read the input.html file.
    main :: IO ()
    main = do
      input <- readFile "input.html"
  3. Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code:
      texts <- runX $ readString 
               [withParseHTML yes, withWarnings no] input 
        //> hasName "td"
        //> getText
  4. The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code:
      let rows = chunksOf 3 texts
      print $ findBiggest rows
  5. By folding through the data, identify the course with the largest capacity using the following code snippet:
    findBiggest :: [[String]] -> [String]
    findBiggest [] = []
    findBiggest items = foldl1 
                        (\a x -> if capacity x > capacity a 
                                 then x else a) items
    
    capacity [a,b,c] = toInt c
    capacity _ = -1
    
    toInt :: String -> Int
    toInt = read
  6. Running the code will display the class with the largest capacity as follows:
    $ runhaskell Main.hs
    
    {"PHIL 1000", "9:30", "120"}
    

How it works...

This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].

Understanding how to perform HTTP GET requests

One of the most resourceful places to find good data is online. GET requests are common methods of communicating with an HTTP web server. In this recipe, we will grab all the links from a Wikipedia article and print them to the terminal. To easily grab all the links, we will use a helpful library called HandsomeSoup, which lets us easily manipulate and traverse a webpage through CSS selectors.

Getting ready

We will be collecting all links from a Wikipedia web page. Make sure to have an Internet connection before running this recipe.

Install the HandsomeSoup CSS selector package, and also install the HXT library if it is not already installed. To do this, use the following commands:

$ cabal install HandsomeSoup
$ cabal install hxt

How to do it...

  1. This recipe requires hxt for parsing HTML and requires HandsomeSoup for the easy-to-use CSS selectors, as shown in the following code snippet:
    import Text.XML.HXT.Core
    import Text.HandsomeSoup
  2. Define and implement main as follows:
    main :: IO ()
    main = do
  3. Pass in the URL as a string to HandsomeSoup's fromUrl function:
        let doc = fromUrl "http://en.wikipedia.org/wiki/Narwhal"
  4. Select all links within the bodyContent field of the Wikipedia page as follows:
        links <- runX $ doc >>> css "#bodyContent a" ! "href"
        print links

How it works…

The HandsomeSoup package allows easy CSS selectors. In this recipe, we run the #bodyContent a selector on a Wikipedia article web page. This finds all link tags that are descendants of an element with the bodyContent ID.

See also…

Another common way to obtain data online is through POST requests. To find out more, refer to the Learning how to perform HTTP POST requests recipe.

Learning how to perform HTTP POST requests

A POST request is another very common HTTP server request used by many APIs. We will be mining the University of Virginia directory search. When sending a POST request for a search query, the Lightweight Directory Access Protocol (LDAP) server replies with a web page of search results.

Getting ready

For this recipe, access to the Internet is necessary.

Install the HandsomeSoup CSS selector package, and also install the HXT library if it is not already installed:

$ cabal install HandsomeSoup
$ cabal install hxt

How to do it...

  1. Import the following libraries:
    import Network.HTTP
    import Network.URI (parseURI)
    import Text.XML.HXT.Core
    import Text.HandsomeSoup
    import Data.Maybe (fromJust)
  2. Define the POST request specified by the directory search website. Depending on the server, the following POST request details would be different. Refer to the following code snippet:
    myRequestURL = "http://www.virginia.edu/cgi-local/ldapweb"
    
    myRequest :: String -> Request_String
    myRequest query = Request { 
        rqURI = fromJust $ parseURI myRequestURL
      , rqMethod = POST
      , rqHeaders = [ mkHeader HdrContentType "text/html"
                    , mkHeader HdrContentLength $ show $ length body ]
      , rqBody = body
      }
      where body = "whitepages=" ++ query
  3. Define and implement main to run the POST request on a query as follows:
    main :: IO ()
    main = do
      response <- simpleHTTP $ myRequest "poon"
  4. Gather the HTML and parse it:
      html <- getResponseBody response
      let doc = readString [withParseHTML yes, withWarnings no] html
  5. Find the table rows and print it out using the following:
      rows <- runX $ doc >>> css "td" //> getText
      print rows

Running the code will display all search results relating to "poon", such as "Poonam" or "Witherspoon".

How it works...

A POST request needs the specified URI, headers, and body. By filling out a Request data type, it can be used to establish a server request.

See also

Refer to the Understanding how to perform HTTP GET requests recipe for details on how to perform a GET request instead.

Traversing online directories for data

A directory search typically provides names and contact information per query. By brute forcing many of these search queries, we can obtain all data stored in the directory listing database. This recipe runs thousands of search queries to obtain as much data as possible from a directory search. This recipe is provided only as a learning tool to see the power and simplicity of data gathering in Haskell.

Getting ready

Make sure to have a strong Internet connection.

Install the hxt and HandsomeSoup packages using Cabal:

$ cabal install hxt
$ cabal install HandsomeSoup

How to do it...

  1. Set up the following dependencies:
    import Network.HTTP
    import Network.URI
    import Text.XML.HXT.Core
    import Text.HandsomeSoup
  2. Define a SearchResult type, which may either fault in an error or result in a success, as presented in the following code:
    type SearchResult = Either SearchResultErr [String]
    data SearchResultErr = NoResultsErr 
                         | TooManyResultsErr 
                         | UnknownErr     
                         deriving (Show, Eq)
  3. Define the POST request specified by the directory search website. Depending on the server, the POST request will be different. Instead of rewriting code, we use the myRequest function defined in the previous recipe.
  4. Write a helper function to obtain the document from a HTTP POST request, as shown in the following code:
    getDoc query = do  
        rsp <- simpleHTTP $ myRequest query
        html <- getResponseBody rsp
        return $ readString [withParseHTML yes, withWarnings no] html
  5. Scan the HTML document and return whether there is an error or provide the resulting data. The code in this function is dependent on the error messages produced by the web page. In our case, the error messages are the following:
    scanDoc doc = do
        errMsg <- runX $ doc >>> css "h3" //> getText
    
        case errMsg of 
            [] -> do 
                text <- runX $ doc >>> css "td" //> getText 
                return $ Right text
            "Error: Sizelimit exceeded":_ -> 
                return $ Left TooManyResultsErr
            "Too many matching entries were found":_ -> 
                return $ Left TooManyResultsErr
            "No matching entries were found":_ -> 
                return $ Left NoResultsErr
            _ -> return $ Left UnknownErr
  6. Define and implement main. We will use a helper function, main', as shown in the following code snippet, to recursively brute force the directory listing:
    main :: IO ()
    main = main' "a"
  7. Run a search of the query and then recursively again on the next query:
    main' query = do
        print query
        doc <- getDoc query
        searchResult <- scanDoc doc
        print searchResult
        case searchResult of
            Left TooManyResultsErr -> 
                main' (nextDeepQuery query)
            _ -> if (nextQuery query) >= endQuery 
                  then print "done!" else main' (nextQuery query)
  8. Write helper functions to define the next logical query as follows:
    nextDeepQuery query = query ++ "a"
    
    nextQuery "z" = endQuery
    nextQuery query = if last query == 'z'
                      then nextQuery $ init query
                      else init query ++ [succ $ last query]
    endQuery = [succ 'z']

How it works...

The code starts by searching for "a" in the directory lookup. This will most likely fault in an error as there are too many results. So, in the next iteration, the code will refine its search by querying for "aa", then "aaa", until there is no longer TooManyResultsErr :: SearchResultErr.

Then, it will enumerate to the next logical search query "aab", and if that produces no result, it will search for "aac", and so on. This brute force prefix search will obtain all items in the database. We can gather the mass of data, such as names and department types, to perform interesting clustering or analysis later on. The following figure shows how the program starts:

How it works...

Using MongoDB queries in Haskell

MongoDB is a nonrelational schemaless database. In this recipe, we will obtain all data from MongoDB into Haskell.

Getting ready

We need to install MongoDB on our local machine and have a database instance running in the background while we run the code in this recipe.

MongoDB installation instructions are located at http://www.mongodb.org. On Debian-based operating systems, we can use apt-get to install MongoDB, using the following command line:

$ sudo apt-get install mongodb

Run the database daemon by specifying the database file path as follows:

$ mkdir ~/db
$ mongod --dbpath ~/db

Fill up a "people" collection with dummy data as follows:

$ mongo
> db.people.insert( {first: "Joe", last: "Shmoe"} )

Install the MongoDB package from Cabal using the following command:

$ cabal install mongoDB

How to do it...

  1. Use the OverloadedString and ExtendedDefaultRules language extensions to make the MongoDB library easier to use:
    {-# LANGUAGE OverloadedStrings, ExtendedDefaultRules #-}
    import Database.MongoDB
  2. Define and implement main to set up a connection to the locally hosted database. Run MongoDB queries defined in the run function as follows:
    main :: IO ()
    main = do
        let db = "test"
        pipe <- runIOE $ connect (host "127.0.0.1")
        e <- access pipe master db run
        close pipe
        print e
  3. In run, we can combine multiple operations. For this recipe, run will only perform one task, that is, gather data from the "people" collection:
    run = getData
    
    getData = rest =<< find (select [] "people") {sort=[]}

How it works...

A pipe is established by the driver between the running program and the database. This allows running MongoDB operations to bridge the program with the database. The find function takes a query, which we construct by evoking the select :: Selector -> Collection -> aQueryOrSelection function.

Other functions can be found in the documentation at http://hackage.haskell.org/package/mongoDB/docs/Database-MongoDB-Query.html.

See also

If the MongoDB database is on a remote server, refer to the Reading from a remote MongoDB server recipe to set up a connection with remote databases.

Reading from a remote MongoDB server

In many cases, it may be more feasible to set up a MongoDB instance on a remote machine. This recipe will cover how to obtain data from a MongoDB hosted remotely.

Getting ready

We should create a remote database. MongoLab (https://mongolab.com) and MongoHQ (http://www.mongohq.com) offer MongoDB as a service and have free options to set up a small development database.

Tip

These services will require us to accept their terms and conditions. For some of us, it may be best to host the database in our own remote server.

Install the MongoDB package from Cabal as follows:

$ cabal install mongoDB

Also, install the helper following helper libraries as follows:

$ cabal install split
$ cabal install uri

How to do it...

  1. Use the OverloadedString and ExtendedDefaultRules language extensions required by the library. Import helper functions as follows:
    {-# LANGUAGE OverloadedStrings, ExtendedDefaultRules #-}
    import Database.MongoDB
    import Text.URI
    import Data.Maybe
    import qualified Data.Text as T
    import Data.List.Split
  2. Specify the remote URI for the database connection as follows:
    mongoURI = "mongodb://user:pass@ds12345.mongolab.com:53788/mydb"
  3. The username, password, hostname, port address number, and database name must be extracted from the URI, as presented in the following code snippet:
    uri = fromJust $ parseURI mongoURI
    
    getUser = head $ splitOn ":" $ fromJust $ uriUserInfo uri
    
    getPass = last $ splitOn ":" $ fromJust $ uriUserInfo uri
    
    getHost = fromJust $ uriRegName uri
    
    getPort = case uriPort uri of 
        Just port -> show port 
        Nothing -> (last.words.show) defaultPort
    
    getDb = T.pack $ tail $ uriPath uri
  4. Create a database connection by reading the host port of the remote URI as follows:
    main :: IO ()
    main = do
        let hostport = getHost ++ ":" ++ getPort
        pipe <- runIOE $ connect (readHostPort hostport)
        e <- access pipe master getDb run
        close pipe
        print e
  5. Optionally authenticate to the database and obtain data from the "people" collection as follows:
    run = do
      auth (T.pack getUser) (T.pack getPass)
      getData
    
    getData = rest =<< find (select [] "people") {sort=[]}

See also

If the database is on a local machine, refer to the Using MongoDB queries in Haskell recipe.

Exploring data from a SQLite database

SQLite is a relational database that enforces a strict schema. It is simply a file on a machine that we can interact with through Structured Query Language (SQL). There is an easy-to-use Haskell library to send these SQL commands to our database.

In this recipe, we will use such a library to extract all data from a SQLite database.

Getting ready

We need to install the SQLite database if it isn't already set up. It can be obtained from http://www.sqlite.org. On Debian systems, we can get it from apt-get using the following command:

$ sudo apt-get install sqlite3

Now create a simple database to test our code, using the following commands:

$ sqlite3 test.db "CREATE TABLE test \
(id INTEGER PRIMARY KEY, str text); \
INSERT INTO test (str) VALUES ('test string');"

We must also install the SQLite Haskell package from Cabal as follows:

$ cabal install sqlite-simple

This recipe will dissect the example code presented on the library's documentation page available at http://hackage.haskell.org/package/sqlite-simple/docs/Database-SQLite-Simple.html.

How to do it…

  1. Use the OverloadedStrings language extension and import the relevant libraries, as shown in the following code:
    {-# LANGUAGE OverloadedStrings #-}
    
    import Control.Applicative
    import Database.SQLite.Simple
    import Database.SQLite.Simple.FromRow
  2. Define a data type for each SQLite table field. Provide it with an instance of the FromRow typeclass so that we may easily parse it from the table, as shown in the following code snippet:
    data TestField = TestField Int String deriving (Show)
    
    instance FromRow TestField where
      fromRow = TestField <$> field <*> field
  3. And lastly, open the database to import everything as follows:
    main :: IO ()
    main = do
      conn <- open "test.db"
      r <- query_ conn "SELECT * from test" :: IO [TestField]
      mapM_ print r
      close conn
Left arrow icon Right arrow icon

Description

Step-by-step recipes filled with practical code samples and engaging examples demonstrate Haskell in practice, and then the concepts behind the code. This book shows functional developers and analysts how to leverage their existing knowledge of Haskell specifically for high-quality data analysis. A good understanding of data sets and functional programming is assumed.

What you will learn

  • Obtain and analyze raw data from various sources including text files, CSV files, databases, and websites
  • Implement practical tree and graph algorithms on various datasets
  • Apply statistical methods such as moving average and linear regression to understand patterns
  • Fiddle with parallel and concurrent code to speed up and simplify timeconsuming algorithms
  • Find clusters in data using some of the most popular machine learning algorithms
  • Manage results by visualizing or exporting data

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 25, 2014
Length: 334 pages
Edition : 1st
Language : English
ISBN-13 : 9781783286348
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jun 25, 2014
Length: 334 pages
Edition : 1st
Language : English
ISBN-13 : 9781783286348
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 99.98
Haskell Design Patterns
$38.99
Haskell Data Analysis cookbook
$60.99
Total $ 99.98 Stars icon

Table of Contents

13 Chapters
1. The Hunt for Data Chevron down icon Chevron up icon
2. Integrity and Inspection Chevron down icon Chevron up icon
3. The Science of Words Chevron down icon Chevron up icon
4. Data Hashing Chevron down icon Chevron up icon
5. The Dance with Trees Chevron down icon Chevron up icon
6. Graph Fundamentals Chevron down icon Chevron up icon
7. Statistics and Analysis Chevron down icon Chevron up icon
8. Clustering and Classification Chevron down icon Chevron up icon
9. Parallel and Concurrent Design Chevron down icon Chevron up icon
10. Real-time Data Chevron down icon Chevron up icon
11. Visualizing Data Chevron down icon Chevron up icon
12. Exporting and Presenting Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(6 Ratings)
5 star 50%
4 star 0%
3 star 33.3%
2 star 0%
1 star 16.7%
Filter icon Filter
Top Reviews

Filter reviews by




Nelson Solano Nov 09, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Was intimidated by all the content within this book, but turns out it's very approachable! Lots of examples and different ways of explaining concepts. I'm already beginning to feel like I have a stronger grasp with Haskell, especially in the context to data science and statistics. I recommend this book to anyone who wants an intro to data analysis techniques for real-world use.
Amazon Verified review Amazon
Student May 12, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book enumerates through dozens of important algorithms used in typical data analysis tasks. It’s one of the most practical and hands-on books on this subject for the Haskell programming language. The examples tie together nicely. I can easily copy and paste the code to test each algorithm. The author also provides the code for each recipe on GitHub.I would recommend this to anyone who has touched Haskell and is willing to explore more interesting applications.
Amazon Verified review Amazon
David Jameson Jul 05, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great idea, I have been looking for a cookbook like this for some time and I have been slowly working through the examples. The Haskell world needs books like this really badly as most documentation that you find focuses more on defining the functions rather than helping you use them.There are some typos here and there such that the compiler produces errors that are hard to understand if you're not already pretty good with Haskell. That had spoiled it a bit for me at first.However, the great news is that up to date source code is available on github and so as long as you get code from there rather than just copying from the book directly, you should be fine.
Amazon Verified review Amazon
garrison jensen Apr 03, 2015
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
I thought this book would explain algorithms. It doesn't. It simply points to numerous libraries that already implement them.I like it, I will use it as a reference for libraries. But if you are expecting to find advice on implementing algorithms yourself, this is not the book for you.
Amazon Verified review Amazon
Jake McCrary Sep 01, 2014
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Packt Publishing recently asked me to write a review of the book Haskell Data Analysis Cookbook by Nishant Shukla. The book is broken into small sections that show you how to do a particular task related to data analysis. These tasks vary from reading a csv file or parsing json to listening to a stream of tweets.I’m not a Haskell programmer. My Haskell experience is limited to reading some books (Learn You a Haskell for Great Good and most of Real World Haskell) and solving some toy problems. All of reading and programming happened years ago though so I’m out of practice.This book is not for a programmer that is unfamiliar with Haskell. If you’ve never studied it before you’ll find yourself turning towards documentation. If you enter this book with a solid understanding of functional programming you can get by with a smaller understanding of Haskell but you will not get much from the book.I’ve only read a few cookbook style books and this one followed the usual format. It will be more useful as a quick reference than as something you would read through. It doesn’t dive deep into any topic but does point you toward libraries for various tasks and shows a short example of using them.A common critic I have of most code examples applies to this book. Most examples do not do qualified imports of namespaces or selective imports of functions from namespaces. This is especially useful when your examples might be read by people who are not be familiar with the languages standard libraries. Reading code and immediately knowing where a function comes from is incredibly useful to understanding.The code for this book is available on GitHub. It is useful to look at the full example for a section. The examples in the book are broken into parts with English explanations and I found that made it hard to fully understand how the code fit together. Looking at the examples in the GitHub repo helped.RecommendationI’d recommend this book for Haskell programmers who find the table of contents interesting. If you read the table of contents and think it would be useful to have a shallow introduction to the topics listed then you’ll find this book useful. It doesn’t give a detailed dive into anything but at least gives you a starting point.If you either learning Haskell or using Haskell then this book doesn’t have much to offer you.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.