Descriptive statistics
Descriptive statistics are numbers that are used to summarize and describe data. In the next chapter, we'll turn our attention to a more sophisticated analysis, the so-called inferential statistics, but for now we'll limit ourselves to simply describing what we can observe about the data contained in the file.
To demonstrate what we mean, let's look at the Electorate
column of the data. This column lists the total number of registered voters in each constituency:
(defn ex-1-6 [] (->> (load-data :uk-scrubbed) (i/$ "Electorate") (count))) ;; 650
We've filtered the nil
field from the dataset; the preceding code should return a list of 650
numbers corresponding to the electorate in each of the UK constituencies.
Descriptive statistics, also called summary statistics, are ways of measuring attributes of sequences of numbers. They help characterize the sequence and can act as a guide for further analysis. Let's start by calculating the two most basic statistics that we can from a sequence of numbers—its mean and its variance.
The mean
The most common way of measuring the average of a data set is with the mean. It's actually one of several ways of measuring the central tendency of the data. The mean, or more precisely, the arithmetic mean, is a straightforward calculation—simply add up the values and divide by the count—but in spite of this it has a somewhat intimidating mathematical notation:
where is pronounced x-bar, the mathematical symbol often used to denote the mean.
To programmers coming to data science from fields outside mathematics or the sciences, this notation can be quite confusing and alienating. Others may be entirely comfortable with this notation, and they can safely skip the next section.
Interpreting mathematical notation
Although mathematical notation may appear obscure and upsetting, there are really only a handful of symbols that will occur frequently in the formulae in this book.
Σ is pronounced sigma and means sum. When you see it in mathematical notation it means that a sequence is being added up. The symbols above and below the sigma indicate the range over which we'll be summing. They're rather like a C-style for
loop and in the earlier formula indicate we'll be summing from i=1 up to i=n. By convention n is the length of the sequence, and sequences in mathematical notation are one-indexed, not zero-indexed, so summing from 1 to n means that we're summing over the entire length of the sequence.
The expression immediately following the sigma is the sequence to be summed. In our preceding formula for the mean, xi immediately follows the sigma. Since i will represent each index from 1 up to n, xi represents each element in the sequence of xs.
Finally, appears just before the sigma, indicating that the entire expression should be multiplied by 1 divided by n (also called the reciprocal of n). This can be simplified to just dividing by n.
Name |
Mathematical symbol |
Clojure equivalent |
---|---|---|
n |
| |
Sigma notation |
| |
Pi notation |
|
Putting this all together, we get "add up the elements in the sequence from the first to the last and divide by the count". In Clojure, this can be written as:
(defn mean [xs] (/ (reduce + xs) (count xs)))
Where xs
stands for "the sequence of xs". We can use our new mean
function to calculate the mean of the UK electorate:
(defn ex-1-7 [] (->> (load-data :uk-scrubbed) (i/$ "Electorate") (mean))) ;; 70149.94
In fact, Incanter already includes a function, mean
, to calculate the mean of a sequence very efficiently in the incanter.stats
namespace. In this chapter (and throughout the book), the incanter.stats
namespace will be required as s
wherever it's used.
The median
The median is another common descriptive statistic for measuring the central tendency of a sequence. If you ordered all the data from lowest to highest, the median is the middle value. If there is an even number of data points in the sequence, the median is usually defined as the mean of the middle two values.
The median is often represented in formulae by , pronounced x-tilde. It's one of the deficiencies of mathematical notation that there's no particularly standard way of expressing the formula for the median value, but nonetheless it's fairly straightforward in Clojure:
(defn median [xs] (let [n (count xs) mid (int (/ n 2))] (if (odd? n) (nth (sort xs) mid) (->> (sort xs) (drop (dec mid)) (take 2) (mean)))))
The median of the UK electorate is:
(defn ex-1-8 [] (->> (load-data :uk-scrubbed) (i/$ "Electorate") (median))) ;; 70813.5
Incanter also has a function for calculating the median value as s/median
.