Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure for Data Science

You're reading from   Clojure for Data Science Statistics, big data, and machine learning for Clojure programmers

Arrow left icon
Product type Paperback
Published in Sep 2015
Publisher
ISBN-13 9781784397180
Length 608 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Henry Garner Henry Garner
Author Profile Icon Henry Garner
Henry Garner
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Statistics FREE CHAPTER 2. Inference 3. Correlation 4. Classification 5. Big Data 6. Clustering 7. Recommender Systems 8. Network Analysis 9. Time Series 10. Visualization Index

Descriptive statistics

Descriptive statistics are numbers that are used to summarize and describe data. In the next chapter, we'll turn our attention to a more sophisticated analysis, the so-called inferential statistics, but for now we'll limit ourselves to simply describing what we can observe about the data contained in the file.

To demonstrate what we mean, let's look at the Electorate column of the data. This column lists the total number of registered voters in each constituency:

(defn ex-1-6 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (count)))

;; 650

We've filtered the nil field from the dataset; the preceding code should return a list of 650 numbers corresponding to the electorate in each of the UK constituencies.

Descriptive statistics, also called summary statistics, are ways of measuring attributes of sequences of numbers. They help characterize the sequence and can act as a guide for further analysis. Let's start by calculating the two most basic statistics that we can from a sequence of numbers—its mean and its variance.

The mean

The most common way of measuring the average of a data set is with the mean. It's actually one of several ways of measuring the central tendency of the data. The mean, or more precisely, the arithmetic mean, is a straightforward calculation—simply add up the values and divide by the count—but in spite of this it has a somewhat intimidating mathematical notation:

The mean

where The mean is pronounced x-bar, the mathematical symbol often used to denote the mean.

To programmers coming to data science from fields outside mathematics or the sciences, this notation can be quite confusing and alienating. Others may be entirely comfortable with this notation, and they can safely skip the next section.

Interpreting mathematical notation

Although mathematical notation may appear obscure and upsetting, there are really only a handful of symbols that will occur frequently in the formulae in this book.

Σ is pronounced sigma and means sum. When you see it in mathematical notation it means that a sequence is being added up. The symbols above and below the sigma indicate the range over which we'll be summing. They're rather like a C-style for loop and in the earlier formula indicate we'll be summing from i=1 up to i=n. By convention n is the length of the sequence, and sequences in mathematical notation are one-indexed, not zero-indexed, so summing from 1 to n means that we're summing over the entire length of the sequence.

The expression immediately following the sigma is the sequence to be summed. In our preceding formula for the mean, xi immediately follows the sigma. Since i will represent each index from 1 up to n, xi represents each element in the sequence of xs.

Finally, Interpreting mathematical notation appears just before the sigma, indicating that the entire expression should be multiplied by 1 divided by n (also called the reciprocal of n). This can be simplified to just dividing by n.

Name

Mathematical symbol

Clojure equivalent

 

n

(count xs)

Sigma notation

Interpreting mathematical notation

(reduce + xs)

Pi notation

Interpreting mathematical notation

(reduce * xs)

Putting this all together, we get "add up the elements in the sequence from the first to the last and divide by the count". In Clojure, this can be written as:

(defn mean [xs]
  (/ (reduce + xs)
     (count xs)))

Where xs stands for "the sequence of xs". We can use our new mean function to calculate the mean of the UK electorate:

(defn ex-1-7 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (mean)))

;; 70149.94

In fact, Incanter already includes a function, mean, to calculate the mean of a sequence very efficiently in the incanter.stats namespace. In this chapter (and throughout the book), the incanter.stats namespace will be required as s wherever it's used.

The median

The median is another common descriptive statistic for measuring the central tendency of a sequence. If you ordered all the data from lowest to highest, the median is the middle value. If there is an even number of data points in the sequence, the median is usually defined as the mean of the middle two values.

The median is often represented in formulae by The median, pronounced x-tilde. It's one of the deficiencies of mathematical notation that there's no particularly standard way of expressing the formula for the median value, but nonetheless it's fairly straightforward in Clojure:

(defn median [xs]
  (let [n   (count xs)
        mid (int (/ n 2))]
    (if (odd? n)
      (nth (sort xs) mid)
      (->> (sort xs)
           (drop (dec mid))
           (take 2)
           (mean)))))

The median of the UK electorate is:

(defn ex-1-8 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (median)))

;; 70813.5

Incanter also has a function for calculating the median value as s/median.

You have been reading a chapter from
Clojure for Data Science
Published in: Sep 2015
Publisher:
ISBN-13: 9781784397180
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime