Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure for Data Science

You're reading from   Clojure for Data Science Statistics, big data, and machine learning for Clojure programmers

Arrow left icon
Product type Paperback
Published in Sep 2015
Publisher
ISBN-13 9781784397180
Length 608 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Henry Garner Henry Garner
Author Profile Icon Henry Garner
Henry Garner
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Statistics FREE CHAPTER 2. Inference 3. Correlation 4. Classification 5. Big Data 6. Clustering 7. Recommender Systems 8. Network Analysis 9. Time Series 10. Visualization Index

The importance of visualizations

Simple visualizations like those earlier are succinct ways of conveying a large quantity of information. They complement the summary statistics we calculated earlier in the chapter, and it's important that we use them. Statistics such as the mean and standard deviation necessarily conceal a lot of information as they reduce a sequence down to just a single number.

The statistician Francis Anscombe devised a collection of four scatter plots, known as Anscombe's Quartet, that have nearly identical statistical properties (including the mean, variance, and standard deviation). In spite of this, it's visually apparent that the distribution of xs and ys are all very different:

The importance of visualizations

Datasets don't have to be contrived to reveal valuable insights when graphed. Take for example this histogram of the marks earned by candidates in Poland's national Matura exam in 2013:

The importance of visualizations

We might expect the abilities of students to be normally distributed and indeed—with the exception of a sharp spike around 30 percent —it is. What we can clearly see is the very human effect of examiners nudging student's grades over the pass mark.

In fact, the distributions for sequences drawn from large samples can be so reliable that any deviation from them can be evidence of illegal activity. Benford's law, also called the first-digit law, is a curious feature of random numbers generated over a large range. One occurs as the leading digit about 30 percent of the time, while larger digits occur less and less frequently. For example, nine occurs as the leading digit less than 5 percent of the time.

Note

Benford's law is named after physicist Frank Benford who stated it in 1938 and showed its consistency across a wide variety of data sources. It had been previously observed by Simon Newcomb over 50 years earlier, who noticed that the pages of his books of logarithm tables were more battered for numbers beginning with the digit one.

Benford showed that the law applied to data as diverse as electricity bills, street addresses, stock prices, population numbers, death rates, and lengths of rivers. The law is so consistent for data sets covering large ranges of values that deviation from it has been accepted as evidence in trials for financial fraud.

Visualizing electorate data

Let's return to the election data and compare the electorate sequence we created earlier against the theoretical normal distribution CDF. We can use Incanter's s/cdf-normal function to generate a normal CDF from a sequence of values. The default mean is 0 and standard deviation is 1, so we'll need to provide the measured mean and standard deviation from the electorate data. These values for our electorate data are 70,150 and 7,679, respectively.

We generated an empirical CDF earlier in the chapter. The following example simply generates each of the two CDFs and plots them on a single c/xy-plot:

(defn ex-1-24 []
  (let [electorate (->> (load-data :uk-scrubbed)
                        (i/$ "Electorate"))
        ecdf   (s/cdf-empirical electorate)
        fitted (s/cdf-normal electorate
                             :mean (s/mean electorate)
                             :sd   (s/sd electorate))]
    (-> (c/xy-plot electorate fitted
                   :x-label "Electorate"
                   :y-label "Probability"
                   :series-label "Fitted"
                   :legend true)
        (c/add-lines electorate (map ecdf electorate)
                     :series-label "Empirical")
        (i/view))))

The preceding example generates the following plot:

Visualizing electorate data

You can see from the proximity of the two lines to each other how closely this data resembles normality, although a slight skew is evident. The skew is in the opposite direction to the dishonest baker CDF we plotted previously, so our electorate data is slightly skewed to the left.

As we're comparing our distribution against the theoretical normal distribution, let's use a Q-Q plot, which will do this by default:

(defn ex-1-25 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (c/qq-plot)
       (i/view)))

The following Q-Q plot does an even better job of highlighting the left skew evident in the data:

Visualizing electorate data

As we expected, the curve bows in the opposite direction to the dishonest baker Q-Q plot earlier in the chapter. This indicates that there is a greater number of constituencies that are smaller than we would expect if the data were more closely normally distributed.

You have been reading a chapter from
Clojure for Data Science
Published in: Sep 2015
Publisher:
ISBN-13: 9781784397180
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime