Clojure for Data Science

Chapter 2. Inference

"I can see nothing," said I, handing it back to my friend.

"On the contrary, Watson, you can see everything. You fail, however, to reason from what you see. You are too timid in drawing your inferences."

--Sir Arthur Conan Doyle, The Adventure of the Blue Carbuncle

In the previous chapter, we introduced a variety of numerical and visual approaches to understand the normal distribution. We discussed descriptive statistics, such as the mean and standard deviation, and how they can be used to summarize large amounts of data succinctly.

A dataset is usually a sample of some larger population. Sometimes, this population is too large to be measured in its entirety. Sometimes, it is intrinsically unmeasurable, either because it is infinite in size or it otherwise cannot be accessed directly. In either case, we are forced to generalize from the data that we have.

In this chapter, we consider statistical inference: how we can go beyond...

Introducing AcmeContent

To help illustrate the concepts in this chapter, let's imagine that we've recently been appointed for the data scientist role at AcmeContent. The company runs a website that lets visitors share video clips that they've enjoyed online.

One of the metrics AcmeContent tracks through its web analytics is dwell time. This is a measure of how long a visitor stays on the site. Clearly, visitors who spend a long time on the site are enjoying themselves and AcmeContent wants its visitors to stay as long as possible. If the mean dwell time increases, our CEO will be very happy.

Note

Dwell time is the length of time between the time a visitor first arrives at a website and the time they make their last request to your site.

A bounce is a visitor who makes only one request—their dwell time is zero.

As the company's new data scientist, it falls to us to analyze the dwell time reported by the website's analytics and measure the success of AcmeContent...

Download the sample code

The code for this chapter is available at https://github.com/clojuredatascience/ch2-inference or from the Packt Publishing's website.

The example data has been generated specifically for this chapter. It's small enough that it has been included with the book's sample code inside the data directory. Consult the book's wiki at http://wiki.clojuredatascience.com for links to further read about dwell time analysis.

Load and inspect the data

In the previous chapter, we used Incanter to load Excel spreadsheets with the incanter.excel/load-xls function. In this chapter, we will load a dataset from a tab-separated text file. For this, we'll make use of incanter.io/read-dataset that expects to receive either a URL object or a file path represented as a string.

The file has been helpfully reformatted by AcmeContent's web team to contain just two columns—the date of the request and the dwell time in seconds. There are column headings in the first row, so we pass :header true to read-dataset:

(defn load-data [file]
  (-> (io/resource file)
      (iio/read-dataset :header true :delim \tab)))

(defn ex-2-1 []
  (-> (load-data "dwell-times.tsv")
      (i/view)))

If you run this code (either in the REPL or on the command line with lein run –e 2.1), you should see an output similar to the following:

Let's see what the dwell times look like as a histogram.

Visualizing the dwell times

We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

This is clearly not a normally distributed data, nor even a very skewed normal distribution. There is no tail to the left of the peak (a visitor clearly can't be on our site for less than zero seconds). While the data tails off steeply to the right at first, it extends much further along the x axis than we would expect from normally distributed data.

When confronted with distributions like this, where values are mostly small but occasionally extreme, it can be useful to plot the y axis as a log scale. Log scales are used to represent events that cover a very large range. Chart axes are ordinarily linear and they partition a range into...

The exponential distribution

The exponential distribution occurs frequently when considering situations where there are many small positive quantities and much fewer larger quantities. Given what we have learned about the Richter scale, it won't be a surprise to learn that the magnitude of earthquakes follows an exponential distribution.

The distribution also frequently occurs in waiting times—the time until the next earthquake of any magnitude roughly follows an exponential distribution as well. The distribution is often used to model failure rates, which is essentially the waiting time until a machine breaks down. Our exponential distribution models a process similar to failure—the waiting time until a visitor gets bored and leaves our site.

The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:

(defn ex-2-4 []
  (let [dwell-times (->> (load-data "dwell-times.tsv")
                         (i...

Introducing AcmeContent

Note

Dwell time is the length of time between the time a visitor first arrives at a website and the time they make their last request to your site.

A bounce is a visitor who makes only one request—their dwell time is zero.

As the company's new data scientist, it falls to us to analyze the dwell time reported by the website's analytics and measure the success of AcmeContent's site.

Download the sample code

The code for this chapter is available at https://github.com/clojuredatascience/ch2-inference or from the Packt Publishing's website.

Load and inspect the data

(defn load-data [file]
  (-> (io/resource file)
      (iio/read-dataset :header true :delim \tab)))

(defn ex-2-1 []
  (-> (load-data "dwell-times.tsv")
      (i/view)))

If you run this code (either in the REPL or on the command line with lein run –e 2.1), you should see an output similar to the following:

Let's see what the dwell times look like as a histogram.

Visualizing the dwell times

We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

The exponential distribution

The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:

(defn ex-2-4 []
  (let [dwell-times (->> (load-data "dwell-times.tsv")
                         (i/$ :dwell-time))]
    (println...

The central limit theorem

We encountered the central limit theorem in the previous chapter when we took samples from a uniform distribution and averaged them. In fact, the central limit theorem works for any distribution of values, provided the distribution has a finite standard deviation.

Note

The central limit theorem states that the distribution of sample means will be normally distributed irrespective of the distribution from which they were calculated.

It doesn't matter that the underlying distribution is exponential—the central limit theorem shows that the mean of random samples taken from any distribution will closely approximate a normal distribution. Let's plot a normal curve over our histogram to see how closely it matches.

To plot a normal curve over our histogram, we have to plot our histogram as a density histogram. This plots the proportion of all the points that have been put in each bucket rather than the frequency. We can then overlay the normal probability density with the...

Standard error

While the standard deviation measures the amount of variation there is within a sample, the standard error measures the amount of variation there is between the means of samples taken from the same population.

Note

The standard error is the standard deviation of the distribution of the sample means.

We have calculated the standard error of dwell time empirically by looking at the previous 6 months of data. But there is an equation that allows us to calculate it from only a single sample:

Here, σ_x is the standard deviation and n is the sample size. This is unlike the descriptive statistics that we studied in the previous chapter. While they described a single sample, the standard error attempts to describe a property of samples in general—the amount of variation in the sample means that variations can be expected for samples of a given size:

(defn standard-deviation [xs]
  (Math/sqrt (variance xs)))

(defn standard-error [xs]
  (/ (standard-deviation xs)
     (Math/sqrt (count xs...

Samples and populations

The words "sample" and "population" mean something very particular to statisticians. A population is the entire collection of entities that a researcher wishes to understand or draw conclusions about. For example, in the second half of the 19th century, Gregor Johann Mendel, the originator of genetics, recorded observations about pea plants. Although he was studying specific plants in a laboratory, his objective was to understand the underlying mechanisms behind heredity in all possible pea plants.

Note

Statisticians refer to the group of entities from which a sample is drawn as the population, whether or not the objects being studied are people.

Since populations may be large—or in the case of Mendel's pea plants, infinite—we must study representative samples and draw inferences about the population from them. To distinguish the measurable attributes of our samples from the inaccessible attributes of the population, we use the word statistics to refer to the sample...

Confidence intervals

Since the standard error of our sample measures how closely we expect our sample mean to match the population mean, we could also consider the inverse—the standard error measures how closely we expect the population mean to match our measured sample mean. In other words, based on our standard error, we can infer that the population mean lies within some expected range of the sample mean with a certain degree of confidence.

Taken together, the "degree of confidence" and the "expected range" define a confidence interval. While stating confidence intervals, it is fairly standard to state the 95 percent interval—we are 95 percent sure that the population parameter lies within the interval. Of course, there remains a 5 percent possibility that it does not.

Whatever the standard error, 95 percent of the population mean will lie between -1.96 and 1.96 standard deviations of the sample mean. 1.96 is therefore the critical z-value for a 95 percent confidence interval.

Note

The name...

Visualizing different populations

Let's remove the filter for weekdays and plot the daily mean dwell time for both week days and weekends:

(defn ex-2-12 []
  (let [means (->> (load-data "dwell-times.tsv")
                   (with-parsed-date)
                   (mean-dwell-times-by-date)
                   (i/$ :dwell-time))]
    (-> (c/histogram means
                     :x-label "Daily mean dwell time unfiltered (s)"
                     :nbins 20)
        (i/view))))

The code generates the following histogram:

The distribution is no longer a normal distribution. In fact, the distribution is bimodal—there are two peaks. The second smaller peak, which corresponds to the newly added weekend data, is lower both because there are not as many weekend days as weekdays and because the distribution has a larger standard error.

Note

In general, distributions with more than one peak are referred to as multimodal. They can be an indicator that two or more normal distributions have been combined...

Hypothesis testing

Hypothesis testing is a formal process for statisticians and data scientists. The standard approach to hypothesis testing is to define an area of research, decide which variables are necessary to measure what is being studied, and then to set out two competing hypotheses. In order to avoid only looking at the data that confirms our biases, researchers will state their hypothesis clearly ahead of time. Statistics can then be used to confirm or refute this hypothesis, based on the data.

In order to help retain our visitors, designers go to work on a variation of our home page that uses all the latest techniques to keep the attention of our audience. We'd like to be sure that our effort isn't in vain, so we will look for an increase in dwell time on the new site.

Therefore, our research question is "does the new site cause the visitor's dwell time to increase"? We decide that this should be tested with reference to the mean dwell time. Now, we need to set out our two hypotheses...

Testing a new site design

The web team at AcmeContent have been hard at work, developing a new site to encourage visitors to stick around for an extended period of time. They've used all the latest techniques and, as a result, we're pretty confident that the site will show a marked improvement in dwell time.

Rather than launching it to all users at once, AcmeContent would like to test the site on a small sample of visitors first. We've educated them about sample bias, and as a result, the web team diverts a random 5 percent of the site traffic to the new site for one day. The result is provided to us as a single text file containing all the day's traffic. Each row shows the dwell time for a visitor who is given a value of either "0" if they used the original site design, or "1" if they saw the new (and hopefully improved) site.

Performing a z-test

While testing with the confidence intervals previously, we had a single population mean to compare to.

With z-testing, we have the option of comparing...

The t-statistic

While using the t-distribution, we look up the t-statistic. Like the z-statistic, this value quantifies how unlikely a particular observed deviation is. For a dual sample t-test, the t-statistic is calculated in the following way:

Here, is the pooled standard error. We could calculate the pooled standard error in the same way as we did earlier:

However, the equation assumes knowledge of the population parameters σ_a and σ_b, which can only be approximated from large samples. The t-test is designed for small samples and does not require us to make assumptions about population variance.

As a result, for the t-test, we write the pooled standard error as the square root of the sum of the standard errors:

In practice, the earlier two equations for the pooled standard error yield identical results, given the same input sequences. The difference in notation just serves to illustrate that with the t-test, we depend only on sample statistics as input. The pooled standard error can be...

Performing the t-test

The difference in the way t-test works stems from the probability distribution from which our p-value is calculated. Having calculated our t-statistic, we need to look up the value in the t-distribution parameterized by the degrees of freedom of our data:

(defn t-test [a b]
  (let [df (+ (count a) (count b) -2)]
    (- 1 (s/cdf-t (i/abs (t-stat a b)) :df df))))

The degrees of freedom are two less than the sizes of the samples combined, which is 298 for our samples.

Recall that we are performing a hypothesis test. So, let's state our null and alternate hypotheses:

H₀: This sample is drawn from a population with a supplied mean
H₁: This sample is drawn from a population with a greater mean

Let's run the example:

(defn ex-2-16 []
  (let [data (->> (load-data "new-site.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial map :dwell-time)))
        a (get data 0)
        b (get data 1)]
    (t-test a b)))

;; 0.0503

This...

One-sample t-test

Independent samples of t-tests are the most common sort of statistical analysis, which provide a very flexible and generic way of comparing whether two samples represent the same or different population. However, in cases where the population mean is already known, there is an even simpler test represented by s/simple-t-test.

We pass a sample and a population mean to test against with the :mu keyword. So, if we simply want to see whether our new site is significantly different from the previous population mean dwell time of 90s, we can run a test like this:

(defn ex-2-18 []
  (let [data (->> (load-data "new-site.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial map :dwell-time)))
        b (get data 1)]
    (clojure.pprint/pprint (s/t-test b :mu 90))))

;; {:p-value 0.13789520958229406,
;;  :df 15,
;;  :n2 nil,
;;  :x-mean 122.0,
;;  :y-mean nil,
;;  :x-var 6669.866666666667,
;;  :conf-int [78.48152745280898 165...

Resampling

To develop an intuition as to how the t-test can confirm and calculate these statistics from so little data, we can apply an approach called resampling. Resampling is based on the premise that each sample is just one of an infinite number of possible samples from a population. We can gain an insight into the nature of what these other samples could have been, and therefore have a better understanding of the underlying population, by taking many new samples from our existing sample.

There are actually several resampling techniques, and we'll discuss one of the simplest—bootstrapping. In bootstrapping, we generate a new sample by repeatedly taking a random value from the original sample with replacement until we generate a sample that is of the same size as the original. Because these values are replaced between each random selection, the same source value can appear multiple times in the new sample. It is as if we were drawing a random card from a deck of playing cards repeatedly...

Testing multiple designs

It's been disappointing to discover that there is no statistical significance behind the increased dwell time of users on the new site design. Better that we discovered this on a small sample of users before we rolled it out to the world though.

Not to be discouraged, AcmeContent's web team works overtime and devises a suite of alternative site designs. Taking the best elements from the other designs, they devise 19 variations to be tested. Together with our original site, which will act as a control, there are 20 different sites to direct visitors to.

Calculating sample means

The web team deploys the 19 new site designs alongside the original site. As mentioned earlier, each receives a random 5 percent of the visitors. We let the test run for 24 hours.

The next day, we receive a file that shows the dwell times for visitors to each of the site designs. Each has been labeled with a number, with site 0 corresponding to the original unaltered design, and numbers 1 to 19...

Multiple comparisons

The fact that with repeated trials, we increase the probability of discovering a significant effect is called the multiple comparisons problem. In general, the solution to the problem is to demand more significant effects when comparing many samples. There is no straightforward solution to this issue though; even with an α of 0.01, we will make a Type I error on an average of 1 percent of the time.

To develop our intuition about how multiple comparisons and statistical significance relate to each other, let's build an interactive web page to simulate the effect of taking multiple samples. It's one of the advantages of using a powerful and general-purpose programming language like Clojure for data analysis that we can run our data processing code in a diverse array of environments.

The code we've written and run so far for this chapter has been compiled for the Java Virtual Machine. But since 2013, there has been an alternative target environment for our compiled code:...

The browser simulation

An HTML page has been supplied in the resources directory of the sample project. Open the page in any modern browser and you should see something similar to the following image:

The left of the page shows a dual histogram with the distribution of two samples, both taken from an exponential distribution. The means of the populations from which the samples are generated are controlled by the sliders at the top right corner of the web page in the box marked as Parameters. Underneath the histogram is a plot showing the two probability densities for the population means based on the samples. These are calculated using the t-distribution, parameterized by the degrees of freedom of the sample. Below these sliders, in a box marked as Settings, are another pair of sliders that set the sample size and confidence intervals for the test. Adjusting the confidence intervals will crop the tails of the t-distributions; at the 95 percent confidence interval, only the central 95 percent...

jStat

As ClojureScript compiles to JavaScript, we can't make use of the libraries that have Java dependencies. Incanter is heavily reliant on several underlying Java libraries, so we have to find an alternative to Incanter for our browser-based statistical analysis.

Note

While building ClojureScript applications, we can't make use of the libraries that depend on Java libraries, as they won't be available in the JavaScript engine which executes our code.

jStat (https://github.com/jstat/jstat) is a JavaScript statistical library. It provides functions to generate sequences according to specific distributions, including the exponential and t-distributions.

To use it, we have to make sure it's available on our webpage. We can do this either by linking it to a remote content distribution network (CDN) or by hosting the file ourselves. The advantage of linking it to a CDN is that visitors, who previously downloaded jStat for another website, can make use of their cached version. However, since our...

B1

Now that we can generate samples of data in ClojureScript, we'd like to be able to plot them on a histogram. We need a pure Clojure alternative to Incanter that will draw histograms in a web-accessible format; the B1 library (https://github.com/henrygarner/b1) provides just this functionality. The name is derived from the fact that it is adapted and simplified from the ClojureScript library C2, which in turn is a simplification of the popular JavaScript data visualization framework D3.

We'll be using B1's simple utility functions in b1.charts to build histograms out of our data in ClojureScript. B1 does not mandate a particular display format; we could use it to draw on a canvas element or even to build diagrams directly out of the HTML elements. However, B1 does contain functions to convert charts to SVG in b1.svg and these can be displayed in all modern web browsers.

Scalable Vector Graphics

SVG stands for Scalable Vector Graphics and defines a set of tags that represent drawing instructions...

Plotting probability densities

In addition to using jStat to generate samples from the exponential distribution, we'll also use it to calculate the probability density for the t-distribution. We can construct a simple function to wrap the jStat.studentt.pdf(t, df) function, providing the correct t-statistic and degrees of freedom to parameterize the distribution:

(defn pdf-t [t & {:keys [df]}]
  (js/jStat.studentt.pdf t df))

An advantage of using ClojureScript is that we have already written the code to calculate the t-statistic from a sample. The code, which worked in Clojure, can be compiled to ClojureScript with no changes whatsoever:

(defn t-statistic [test {:keys [mean n sd]}]
  (/ (- mean test)
     (/ sd (Math/sqrt n))))

To render the probability density, we can use B1's c/function-area-plot. This will generate an area plot from the line described by a function. The provided function simply needs to accept an x and return the corresponding y.

A slight complication is that the value...

State and Reagent

State in ClojureScript is managed in the same way as Clojure applications—through the use of atoms, refs, or agents. Atoms provide uncoordinated, synchronous access to a single identity and are an excellent choice for storing the application state. Using an atom ensures that the application always sees a single, consistent view of the data.

Reagent is a ClojureScript library that provides a mechanism to update the content of a web page in response to changing the value of an atom. Markup and state are bound together, so that markup is regenerated whenever the application state is updated.

Reagent also provides syntax to render HTML in an idiomatic way using Clojure data structures. This means that both the content and the interactivity of the page can be handled in one language.

Updating state

With data held in a Reagent atom, updating the state is achieved by calling the swap! function with two arguments—the atom we wish to update and a function to transform the state of the...

Simulating multiple tests

Each time the New Sample button is pressed, a pair of new samples from an exponential distribution with population means taken from the sliders are generated. The samples are plotted on a histogram and, underneath, a probability density function is drawn showing the standard error for the sample. As the confidence intervals are changed, observe how the acceptable deviation of the standard error changes as well.

Each time the button is pressed, we could think of it as a significance test with an alpha set to the complement of the confidence interval. In other words, if the probability distributions for the sample means overlap at the 95 percent confidence interval, we cannot reject the null hypothesis at the 5 percent significance level.

Observe how, even when the population means are identical, occasional large deviations in the means will occur. Where samples differ by more than our standard error, we can accept the alternate hypothesis. With a confidence level of...

The Bonferroni correction

We therefore require an alternative approach while conducting multiple tests that will account for an increased probability of discovering a significant effect through repeated trials. The Bonferroni correction is a very simple adjustment that ensures we are unlikely to make Type I errors. It does this by adjusting the alpha for our tests.

The adjustment is a simple one—the Bonferroni correction simply divides our desired alpha by the number of tests we are performing. For example, if we had k site designs to test and an experimental alpha of 0.05, the Bonferroni correction is expressed as:

This is a safe way to mitigate the increased probability of making a Type I error in multiple testing. The following example is identical to ex-2-22, except the alpha value has been divided by the number of groups:

(defn ex-2-23 []
  (let [data (->> (load-data "multiple-sites.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial...

Analysis of variance

Analysis of variance, often shortened to ANOVA, is a series of statistical methods used to measure the statistical significance of the difference between groups. It was developed by Ronald Fisher, an extremely gifted statistician, who also popularized significance testing through his work on biological testing.

Our tests, using the z-statistic and t-statistic, have focused on sample means as the primary mechanism to draw a distinction between the two samples. In each case, we looked for a difference in the means divided by the level of difference we could reasonably expect and quantified by the standard error.

The mean isn't the only statistic that might indicate a difference between samples. In fact, it is also possible to use the sample variance as an indicator of statistical difference.

To illustrate how this might work, consider the preceding diagram. Each of the three groups on the left could represent samples of dwell times for a specific page with its own mean and...

Page 1 of 32

Stephen Walker Jan 02, 2019

I almost never write reviews (maybe never), but felt that this book deserves more attention. It provides a solid intuition to data science in clojure. Well written and nice to follow through the examples in each chapter. I only hope to hear of more books from Henry Garner!! Selfishly with more depth in time series analysis and online processing of that data.

Amazon Verified review

Dame Edna May 16, 2016

One of the best books for learning data science. Very thorough, practical, well written and interesting.

madeinquant Jun 03, 2020

This book is the best to learn Clojure and data science; Clojure is a unique programming language and it is not a popular programming language, learning Clojure is easy at the beginning but it is very difficult to solve a real world problem. Once you familiarize with Clojure, you will respect the power of LISP (Clojure is a dialect of LISP, Why Clojure?, Uncle Bob) Fortunately, I did a lot of old school programming (i.e. ANSI C, C++, LISP), since there are a lot of original concepts of LISP but learning Clojure is challenging.If you expect to cut and paste the code of your programming, this may not be suitable for you. I was programming data science and algorithms in javascript, C and programming data science in python 3. Python is beautiful, effective and the community has grown, you can find almost all useful data science libraries by googling, however, almost all libraries are difficult to learn the algorithm inside the box even though they are open-source. Learning the algorithm from scratch is a nightmare, I learn to code in Clojure and to migrate an existing code into Clojure. There are a lot of headaches during algorithms and Clojure learning, this book helps me a lot to resolve problems, all that said, this book is suitable for readers who have experienced a lot of programming, read the algorithms, get the concepts and write the algorithms in your own familiar programming language.

skliarpawlo Nov 16, 2017

Great and very useful book for beginners both in Clojure and Data Science. So glad I ordered it