Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Haskell Data Analysis cookbook
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook: Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes

eBook
€8.99 €36.99
Paperback
€45.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Haskell Data Analysis cookbook

Chapter 2. Integrity and Inspection

This chapter will cover the following recipes:

  • Trimming excess whitespace
  • Ignoring punctuation and specific characters
  • Coping with unexpected or missing input
  • Validating records by matching regular expressions
  • Lexing and parsing an e-mail address
  • Deduplication of nonconflicting data items
  • Deduplication of conflicting data items
  • Implementing a frequency table using Data.List
  • Implementing a frequency table using Data.MultiSet
  • Computing the Manhattan distance
  • Computing the Euclidean distance
  • Comparing scaled data using the Pearson correlation coefficient
  • Comparing sparse data using cosine similarity

Introduction

Introduction

The conclusions drawn from data analysis are only as robust as the quality of the data itself. After obtaining raw text, the next natural step is to validate and clean it carefully. Even the slightest bias may risk the integrity of the results. Therefore, we must take great precautionary measures, which involve thorough inspection, to ensure sanity checks are performed on our data before we begin to understand it. This section should be the starting point for cleaning data in Haskell.

Real-world data often has an impurity that needs to be addressed before it can be processed. For example, extraneous whitespaces or punctuation could clutter data, making it difficult to parse. Duplication and data conflicts are another area of unintended consequences of reading real-world data. Sometimes it's just reassuring to know that data makes sense by conducting sanity checks. Some examples of sanity checks include matching regular expressions as well as detecting outliers by establishing...

Trimming excess whitespace

The text obtained from sources may unintentionally include beginning or trailing whitespace characters. When parsing such an input, it is often wise to trim the text. For example, when Haskell source code contains trailing whitespace, the GHC compiler ignores it through a process called lexing. The lexer produces a sequence of tokens, effectively ignoring meaningless characters such as excess whitespace.

In this recipe, we will use built-in libraries to make our own trim function.

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. Import the isSpace :: Char -> Bool function from the built-in Data.Char package:
    import Data.Char (isSpace)
  2. Write a trim function that removes the beginning and trailing whitespace:
    trim :: String -> String
    trim = f . f
      where f = reverse . dropWhile isSpace
  3. Test it out within main:
    main :: IO ()
    main = putStrLn $ trim " wahoowa! "
  4. Running the code will result in the following trimmed...

Ignoring punctuation and specific characters

Usually in natural language processing, some uninformative words or characters, called stop words, can be filtered out for easier handling. When computing word frequencies or extracting sentiment data from a corpus, punctuation or special characters might need to be ignored. This recipe demonstrates how to remove these specific characters from the body of a text.

How to do it...

There are no imports necessary. Create a new file, which we will call Main.hs, and perform the following steps:

  1. Implement main and define a string called quote. The back slashes (\) represent multiline strings:
    main :: IO ()
    main = do
      let quote = "Deep Blue plays very good chess-so what?\ 
        \Does that tell you something about how we play chess?\
        \No. Does it tell you about how Kasparov envisions,\ 
        \understands a chessboard? (Douglas Hofstadter)"
      putStrLn $ (removePunctuation.replaceSpecialSymbols) quote
  2. Replace all punctuation marks with an empty...

Coping with unexpected or missing input

Data sources often contain incomplete and unexpected data. One common approach to parsing such data in Haskell is using the Maybe data type.

Imagine designing a function to find the nth element in a list of characters. A naïve implementation may have the type Int -> [Char] -> Char. However, if the function is trying to access an index out of bounds, we should try to indicate that an error has occurred.

A common way to deal with these errors is by encapsulating the output Char into a Maybe context. Having the type Int -> [Char] -> Maybe Char allows for some better error handling. The constructors for Maybe are Just a or Nothing, which will become apparent by running GHCi and testing out the following commands:

$ ghci

Prelude> :type Just 'c'
Just 'c' :: Maybe Char

Prelude> :type Nothing
Nothing :: Maybe a

We will set each field as a Maybe data type so that whenever a field cannot be parsed, it will simply be...

Validating records by matching regular expressions

A regular expression is a language for matching patterns in a string. Our Haskell code can process a regular expression to examine a text and tell us whether or not it matches the rules described by the expression. Regular expression matching can be used to validate or identify a pattern in the text.

In this recipe, we will read a corpus of English text to find possible candidates of full names in a sea of words. Full names usually consist of two words that start with a capital letter. We use this heuristic to extract all the names from an article.

Getting ready

Create an input.txt file with some text. In this example, we use a snippet from a New York Times article on dinosaurs (http://www.nytimes.com/2013/12/17/science/earth/outsider-challenges-papers-on-growth-of-dinosaurs.html)

Other co-authors of Dr. Erickson's include Mark Norell, chairman of paleontology at the American Museum of Natural History; Philip Currie, a professor of dinosaur...

Introduction


The conclusions drawn from data analysis are only as robust as the quality of the data itself. After obtaining raw text, the next natural step is to validate and clean it carefully. Even the slightest bias may risk the integrity of the results. Therefore, we must take great precautionary measures, which involve thorough inspection, to ensure sanity checks are performed on our data before we begin to understand it. This section should be the starting point for cleaning data in Haskell.

Real-world data often has an impurity that needs to be addressed before it can be processed. For example, extraneous whitespaces or punctuation could clutter data, making it difficult to parse. Duplication and data conflicts are another area of unintended consequences of reading real-world data. Sometimes it's just reassuring to know that data makes sense by conducting sanity checks. Some examples of sanity checks include matching regular expressions as well as detecting outliers by establishing...

Trimming excess whitespace


The text obtained from sources may unintentionally include beginning or trailing whitespace characters. When parsing such an input, it is often wise to trim the text. For example, when Haskell source code contains trailing whitespace, the GHC compiler ignores it through a process called lexing. The lexer produces a sequence of tokens, effectively ignoring meaningless characters such as excess whitespace.

In this recipe, we will use built-in libraries to make our own trim function.

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. Import the isSpace :: Char -> Bool function from the built-in Data.Char package:

    import Data.Char (isSpace)
  2. Write a trim function that removes the beginning and trailing whitespace:

    trim :: String -> String
    trim = f . f
      where f = reverse . dropWhile isSpace
  3. Test it out within main:

    main :: IO ()
    main = putStrLn $ trim " wahoowa! "
  4. Running the code will result in the following trimmed string:

    ...

Ignoring punctuation and specific characters


Usually in natural language processing, some uninformative words or characters, called stop words, can be filtered out for easier handling. When computing word frequencies or extracting sentiment data from a corpus, punctuation or special characters might need to be ignored. This recipe demonstrates how to remove these specific characters from the body of a text.

How to do it...

There are no imports necessary. Create a new file, which we will call Main.hs, and perform the following steps:

  1. Implement main and define a string called quote. The back slashes (\) represent multiline strings:

    main :: IO ()
    main = do
      let quote = "Deep Blue plays very good chess-so what?\ 
        \Does that tell you something about how we play chess?\
        \No. Does it tell you about how Kasparov envisions,\ 
        \understands a chessboard? (Douglas Hofstadter)"
      putStrLn $ (removePunctuation.replaceSpecialSymbols) quote
  2. Replace all punctuation marks with an empty string, and...

Coping with unexpected or missing input


Data sources often contain incomplete and unexpected data. One common approach to parsing such data in Haskell is using the Maybe data type.

Imagine designing a function to find the nth element in a list of characters. A naïve implementation may have the type Int -> [Char] -> Char. However, if the function is trying to access an index out of bounds, we should try to indicate that an error has occurred.

A common way to deal with these errors is by encapsulating the output Char into a Maybe context. Having the type Int -> [Char] -> Maybe Char allows for some better error handling. The constructors for Maybe are Just a or Nothing, which will become apparent by running GHCi and testing out the following commands:

$ ghci

Prelude> :type Just 'c'
Just 'c' :: Maybe Char

Prelude> :type Nothing
Nothing :: Maybe a

We will set each field as a Maybe data type so that whenever a field cannot be parsed, it will simply be represented as Nothing....

Validating records by matching regular expressions


A regular expression is a language for matching patterns in a string. Our Haskell code can process a regular expression to examine a text and tell us whether or not it matches the rules described by the expression. Regular expression matching can be used to validate or identify a pattern in the text.

In this recipe, we will read a corpus of English text to find possible candidates of full names in a sea of words. Full names usually consist of two words that start with a capital letter. We use this heuristic to extract all the names from an article.

Getting ready

Create an input.txt file with some text. In this example, we use a snippet from a New York Times article on dinosaurs (http://www.nytimes.com/2013/12/17/science/earth/outsider-challenges-papers-on-growth-of-dinosaurs.html)

Other co-authors of Dr. Erickson's include Mark Norell, chairman of paleontology at the American Museum of Natural History; Philip Currie, a professor of dinosaur...

Lexing and parsing an e-mail address


An elegant way to clean data is by defining a lexer to split up a string into tokens. In this recipe, we will parse an e-mail address using the attoparsec library. This will naturally allow us to ignore the surrounding whitespace.

Getting ready

Import the attoparsec parser combinator library:

$ cabal install attoparsec

How to do it…

Create a new file, which we will call Main.hs, and perform the following steps:

  1. Use the GHC OverloadedStrings language extension to more legibly use the Text data type throughout the code. Also, import the other relevant libraries:

    {-# LANGUAGE OverloadedStrings #-}
    import Data.Attoparsec.Text
    import Data.Char (isSpace, isAlphaNum)
  2. Declare a data type for an e-mail address:

    data E-mail = E-mail 
      { user :: String
      , host :: String
      } deriving Show
  3. Define how to parse an e-mail address. This function can be as simple or as complicated as required:

    e-mail :: Parser E-mail
    e-mail = do
      skipSpace
      user <- many' $ satisfy isAlphaNum...

Deduplication of nonconflicting data items


Duplication is a common problem when collecting large amounts of data. In this recipe, we will combine similar records in a way that ensures no information is lost.

Getting ready

Create an input.csv file with repeated data:

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will be using the CSV, Map, and Maybe packages:

    import Text.CSV (parseCSV, Record)
    import Data.Map (fromListWith)
    import Control.Applicative ((<|>))
  2. Define the Item data type corresponding to the CSV input:

    data Item = Item   { name :: String
                       , color :: Maybe String
                       , cost :: Maybe Float
                       } deriving Show
  3. Get each record from CSV and put them in a map by calling our doWork function:

    main :: IO ()
    main = do
      let fileName = "input.csv"
      input <- readFile fileName
      let csv = parseCSV fileName input
      either handleError doWork csv
  4. If we're unable to parse CSV, print an error message...

Deduplication of conflicting data items


Unfortunately, information about an item may be inconsistent throughout the corpus. Collision strategies are often domain-dependent, but one common way to manage this conflict is by simply storing all variations of the data. In this recipe, we will read a CSV file that contains information about musical artists and store all of the information about their songs and genres in a set.

Getting ready

Create a CSV input file with the following musical artists. The first column is for the name of the artist or band. The second column is the song name, and the third is the genre. Notice how some musicians have multiple songs or genres.

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will be using the CSV, Map, and Set packages:

    import Text.CSV (parseCSV, Record)
    import Data.Map (fromListWith)
    import qualified Data.Set as S
  2. Define the Artist data type corresponding to the CSV input. For fields that may contain conflicting...

Implementing a frequency table using Data.List


A frequency map of values is often useful to detect outliers. We can use it to identify frequencies that seem out of the ordinary. In this recipe, we will be counting the number of different colors in a list.

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will use the group and sort functions from Data.List:

    import Data.List (group, sort)
  2. Define a simple data type for colors:

    data Color = Red | Green | Blue deriving (Show, Ord, Eq)
  3. Create a list of these colors:

    main :: IO ()
    main = do
      let items = [Red, Green, Green, Blue, Red, Green, Green]
  4. Implement the frequency map and print it out:

      let freq = 
         map (\x -> (head x, length x)) . group . sort $ items
      print freq

How it works...

Grouping identical items after sorting the list is the central idea.

See the following step-by-step evaluation in ghci:

Prelude> sort items

[Red,Red,Green,Green,Green,Green,Blue]
Prelude> group it

[[Red,Red...
Left arrow icon Right arrow icon

Description

Step-by-step recipes filled with practical code samples and engaging examples demonstrate Haskell in practice, and then the concepts behind the code. This book shows functional developers and analysts how to leverage their existing knowledge of Haskell specifically for high-quality data analysis. A good understanding of data sets and functional programming is assumed.

What you will learn

  • Obtain and analyze raw data from various sources including text files, CSV files, databases, and websites
  • Implement practical tree and graph algorithms on various datasets
  • Apply statistical methods such as moving average and linear regression to understand patterns
  • Fiddle with parallel and concurrent code to speed up and simplify timeconsuming algorithms
  • Find clusters in data using some of the most popular machine learning algorithms
  • Manage results by visualizing or exporting data

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 25, 2014
Length: 334 pages
Edition : 1st
Language : English
ISBN-13 : 9781783286348
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jun 25, 2014
Length: 334 pages
Edition : 1st
Language : English
ISBN-13 : 9781783286348
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 75.98
Haskell Data Analysis cookbook
€45.99
Haskell Design Patterns
€29.99
Total 75.98 Stars icon
Banner background image

Table of Contents

13 Chapters
1. The Hunt for Data Chevron down icon Chevron up icon
2. Integrity and Inspection Chevron down icon Chevron up icon
3. The Science of Words Chevron down icon Chevron up icon
4. Data Hashing Chevron down icon Chevron up icon
5. The Dance with Trees Chevron down icon Chevron up icon
6. Graph Fundamentals Chevron down icon Chevron up icon
7. Statistics and Analysis Chevron down icon Chevron up icon
8. Clustering and Classification Chevron down icon Chevron up icon
9. Parallel and Concurrent Design Chevron down icon Chevron up icon
10. Real-time Data Chevron down icon Chevron up icon
11. Visualizing Data Chevron down icon Chevron up icon
12. Exporting and Presenting Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(6 Ratings)
5 star 50%
4 star 0%
3 star 33.3%
2 star 0%
1 star 16.7%
Filter icon Filter
Top Reviews

Filter reviews by




Nelson Solano Nov 09, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Was intimidated by all the content within this book, but turns out it's very approachable! Lots of examples and different ways of explaining concepts. I'm already beginning to feel like I have a stronger grasp with Haskell, especially in the context to data science and statistics. I recommend this book to anyone who wants an intro to data analysis techniques for real-world use.
Amazon Verified review Amazon
Student May 12, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book enumerates through dozens of important algorithms used in typical data analysis tasks. It’s one of the most practical and hands-on books on this subject for the Haskell programming language. The examples tie together nicely. I can easily copy and paste the code to test each algorithm. The author also provides the code for each recipe on GitHub.I would recommend this to anyone who has touched Haskell and is willing to explore more interesting applications.
Amazon Verified review Amazon
David Jameson Jul 05, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great idea, I have been looking for a cookbook like this for some time and I have been slowly working through the examples. The Haskell world needs books like this really badly as most documentation that you find focuses more on defining the functions rather than helping you use them.There are some typos here and there such that the compiler produces errors that are hard to understand if you're not already pretty good with Haskell. That had spoiled it a bit for me at first.However, the great news is that up to date source code is available on github and so as long as you get code from there rather than just copying from the book directly, you should be fine.
Amazon Verified review Amazon
garrison jensen Apr 03, 2015
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
I thought this book would explain algorithms. It doesn't. It simply points to numerous libraries that already implement them.I like it, I will use it as a reference for libraries. But if you are expecting to find advice on implementing algorithms yourself, this is not the book for you.
Amazon Verified review Amazon
Jake McCrary Sep 01, 2014
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Packt Publishing recently asked me to write a review of the book Haskell Data Analysis Cookbook by Nishant Shukla. The book is broken into small sections that show you how to do a particular task related to data analysis. These tasks vary from reading a csv file or parsing json to listening to a stream of tweets.I’m not a Haskell programmer. My Haskell experience is limited to reading some books (Learn You a Haskell for Great Good and most of Real World Haskell) and solving some toy problems. All of reading and programming happened years ago though so I’m out of practice.This book is not for a programmer that is unfamiliar with Haskell. If you’ve never studied it before you’ll find yourself turning towards documentation. If you enter this book with a solid understanding of functional programming you can get by with a smaller understanding of Haskell but you will not get much from the book.I’ve only read a few cookbook style books and this one followed the usual format. It will be more useful as a quick reference than as something you would read through. It doesn’t dive deep into any topic but does point you toward libraries for various tasks and shows a short example of using them.A common critic I have of most code examples applies to this book. Most examples do not do qualified imports of namespaces or selective imports of functions from namespaces. This is especially useful when your examples might be read by people who are not be familiar with the languages standard libraries. Reading code and immediately knowing where a function comes from is incredibly useful to understanding.The code for this book is available on GitHub. It is useful to look at the full example for a section. The examples in the book are broken into parts with English explanations and I found that made it hard to fully understand how the code fit together. Looking at the examples in the GitHub repo helped.RecommendationI’d recommend this book for Haskell programmers who find the table of contents interesting. If you read the table of contents and think it would be useful to have a shallow introduction to the topics listed then you’ll find this book useful. It doesn’t give a detailed dive into anything but at least gives you a starting point.If you either learning Haskell or using Haskell then this book doesn’t have much to offer you.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.