Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure for Data Science

You're reading from   Clojure for Data Science Statistics, big data, and machine learning for Clojure programmers

Arrow left icon
Product type Paperback
Published in Sep 2015
Publisher
ISBN-13 9781784397180
Length 608 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Henry Garner Henry Garner
Author Profile Icon Henry Garner
Henry Garner
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Statistics FREE CHAPTER 2. Inference 3. Correlation 4. Classification 5. Big Data 6. Clustering 7. Recommender Systems 8. Network Analysis 9. Time Series 10. Visualization Index

Skewness

Skewness is the name for the asymmetry of a distribution about its mode. Negative skew, or left skew, indicates that the area under the graph is larger on the left side of the mode. Positive skew, or right skew, indicates that the area under the graph is larger on the right side of the mode.

Skewness

Incanter has a built-in function for measuring skewness in the stats namespace:

(defn ex-1-20 []
  (let [weights (take 10000 (dishonest-baker 950 30))]
    {:mean (mean weights)
     :median (median weights)
     :skewness (s/skewness weights)}))

The preceding example shows that the skewness of the dishonest baker's output is about 0.4, quantifying the skew evident in the histogram.

Quantile-quantile plots

We encountered quantiles as a means of describing the distribution of data earlier in the chapter. Recall that the quantile function accepts a number between zero and one and returns the value of the sequence at that point. 0.5 corresponds to the median value.

Plotting the quantiles of your data against the quantiles of the normal distribution allows us to see how our measured data compares against the theoretical distribution. Plots such as this are called Q-Q plots and they provide a quick and intuitive way of determining normality. For data corresponding closely to the normal distribution, the Q-Q Plot is a straight line. Deviations from a straight line indicate the manner in which the data deviates from the idealized normal distribution.

Let's plot Q-Q plots for both our honest and dishonest bakers side-by-side. Incanter's c/qq-plot function accepts the list of data points and generates a scatter chart of the sample quantiles plotted against the quantiles from the theoretical normal distribution:

(defn ex-1-21 []
  (->> (honest-baker 1000 30)
       (take 10000)
       (c/qq-plot)
       (i/view))
  (->> (dishonest-baker 950 30)
       (take 10000)
       (c/qq-plot)
       (i/view)))

The preceding code will produce the following plots:

Quantile-quantile plots

The Q-Q plot for the honest baker is shown earlier. The dishonest baker's plot is next:

Quantile-quantile plots

The fact that the line is curved indicates that the data is positively skewed; a curve in the other direction would indicate negative skew. In fact, Q-Q plots make it easier to discern a wide variety of deviations from the standard normal distribution, as shown in the following diagram:

Quantile-quantile plots

Q-Q plots compare the distribution of the honest and dishonest baker against the theoretical normal distribution. In the next section, we'll compare several alternative ways of visually comparing two (or more) measured sequences of values with each other.

You have been reading a chapter from
Clojure for Data Science
Published in: Sep 2015
Publisher:
ISBN-13: 9781784397180
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime