Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure for Data Science

You're reading from   Clojure for Data Science Statistics, big data, and machine learning for Clojure programmers

Arrow left icon
Product type Paperback
Published in Sep 2015
Publisher
ISBN-13 9781784397180
Length 608 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Henry Garner Henry Garner
Author Profile Icon Henry Garner
Henry Garner
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Statistics FREE CHAPTER 2. Inference 3. Correlation 4. Classification 5. Big Data 6. Clustering 7. Recommender Systems 8. Network Analysis 9. Time Series 10. Visualization Index

Poincaré's baker

There's a story that, while almost certainly apocryphal, allows us to look in more detail at the way in which the central limit theorem allows us to reason about how distributions are formed. It concerns the celebrated nineteenth century French polymath Henri Poincaré who, so the story goes, weighed his bread every day for a year.

Baking was a regulated profession, and Poincaré discovered that, while the weights of the bread followed a normal distribution, the peak was at 950g rather than the advertised 1kg. He reported his baker to the authorities and so the baker was fined.

The next year, Poincaré continued to weigh his bread from the same baker. He found the mean value was now 1kg, but that the distribution was no longer symmetrical around the mean. The distribution was skewed to the right, consistent with the baker giving Poincaré only the heaviest of his loaves. Poincaré reported his baker to the authorities once more and his baker was fined a second time.

Whether the story is true or not needn't concern us here; it's provided simply to illustrate a key point—the distribution of a sequence of numbers can tell us something important about the process that generated it.

Generating distributions

To develop our intuition about the normal distribution and variance, let's model an honest and dishonest baker using Incanter's distribution functions. We can model the honest baker as a normal distribution with a mean of 1,000, corresponding to a fair loaf of 1kg. We'll assume a variance in the baking process that results in a standard deviation of 30g.

(defn honest-baker [mean sd]
  (let [distribution (d/normal-distribution mean sd)]
    (repeatedly #(d/draw distribution))))

(defn ex-1-18 []
  (-> (take 10000 (honest-baker 1000 30))
      (c/histogram :x-label "Honest baker"
                   :nbins 25)
      (i/view)))

The preceding code will provide an output similar to the following histogram:

Generating distributions

Now, let's model a baker who sells only the heaviest of his loaves. We partition the sequence into groups of thirteen (a "baker's dozen") and pick the maximum value:

(defn dishonest-baker [mean sd]
  (let [distribution (d/normal-distribution mean sd)]
    (->> (repeatedly #(d/draw distribution))
         (partition 13)
         (map (partial apply max)))))

(defn ex-1-19 []
  (-> (take 10000 (dishonest-baker 950 30))
      (c/histogram :x-label "Dishonest baker"
                   :nbins 25)
      (i/view)))

The preceding code will produce a histogram similar to the following:

Generating distributions

It should be apparent that this histogram does not look quite like the others we have seen. The mean value is still 1kg, but the spread of values around the mean is no longer symmetrical. We say that this histogram indicates a skewed normal distribution.

You have been reading a chapter from
Clojure for Data Science
Published in: Sep 2015
Publisher:
ISBN-13: 9781784397180
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image