You're reading from Haskell Data Analysis cookbook Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes

Product type Paperback

Published in Jun 2014

Publisher

ISBN-13 9781783286331

Length 334 pages

Edition 1st Edition

Languages

Haskell

Concepts

Data Analysis

Author (1):

Nishant Shukla

View More author details

Table of Contents (14) Chapters

Preface

1. The Hunt for Data FREE CHAPTER

2. Integrity and Inspection

3. The Science of Words

4. Data Hashing

5. The Dance with Trees

6. Graph Fundamentals

7. Statistics and Analysis

8. Clustering and Classification

9. Parallel and Concurrent Design

10. Real-time Data

11. Visualizing Data

12. Exporting and Presenting

Index

Traversing online directories for data

A directory search typically provides names and contact information per query. By brute forcing many of these search queries, we can obtain all data stored in the directory listing database. This recipe runs thousands of search queries to obtain as much data as possible from a directory search. This recipe is provided only as a learning tool to see the power and simplicity of data gathering in Haskell.

Getting ready

Make sure to have a strong Internet connection.

Install the hxt and HandsomeSoup packages using Cabal:

$ cabal install hxt
$ cabal install HandsomeSoup

How to do it...

Set up the following dependencies:

import Network.HTTP
import Network.URI
import Text.XML.HXT.Core
import Text.HandsomeSoup

Define a SearchResult type, which may either fault in an error or result in a success, as presented in the following code:

type SearchResult = Either SearchResultErr [String]
data SearchResultErr = NoResultsErr 
                     | TooManyResultsErr 
                     | UnknownErr     
                     deriving (Show, Eq)

Define the POST request specified by the directory search website. Depending on the server, the POST request will be different. Instead of rewriting code, we use the myRequest function defined in the previous recipe.

Write a helper function to obtain the document from a HTTP POST request, as shown in the following code:

getDoc query = do  
    rsp <- simpleHTTP $ myRequest query
    html <- getResponseBody rsp
    return $ readString [withParseHTML yes, withWarnings no] html

Scan the HTML document and return whether there is an error or provide the resulting data. The code in this function is dependent on the error messages produced by the web page. In our case, the error messages are the following:

scanDoc doc = do
    errMsg <- runX $ doc >>> css "h3" //> getText

    case errMsg of 
        [] -> do 
            text <- runX $ doc >>> css "td" //> getText 
            return $ Right text
        "Error: Sizelimit exceeded":_ -> 
            return $ Left TooManyResultsErr
        "Too many matching entries were found":_ -> 
            return $ Left TooManyResultsErr
        "No matching entries were found":_ -> 
            return $ Left NoResultsErr
        _ -> return $ Left UnknownErr

Define and implement main. We will use a helper function, main', as shown in the following code snippet, to recursively brute force the directory listing:
```
main :: IO ()
main = main' "a"
```

Run a search of the query and then recursively again on the next query:

main' query = do
    print query
    doc <- getDoc query
    searchResult <- scanDoc doc
    print searchResult
    case searchResult of
        Left TooManyResultsErr -> 
            main' (nextDeepQuery query)
        _ -> if (nextQuery query) >= endQuery 
              then print "done!" else main' (nextQuery query)

Write helper functions to define the next logical query as follows:

nextDeepQuery query = query ++ "a"

nextQuery "z" = endQuery
nextQuery query = if last query == 'z'
                  then nextQuery $ init query
                  else init query ++ [succ $ last query]
endQuery = [succ 'z']

How it works...

The code starts by searching for "a" in the directory lookup. This will most likely fault in an error as there are too many results. So, in the next iteration, the code will refine its search by querying for "aa", then "aaa", until there is no longer TooManyResultsErr :: SearchResultErr.

Then, it will enumerate to the next logical search query "aab", and if that produces no result, it will search for "aac", and so on. This brute force prefix search will obtain all items in the database. We can gather the mass of data, such as names and department types, to perform interesting clustering or analysis later on. The following figure shows how the program starts: