Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Clojure for Data Science
Clojure for Data Science

Clojure for Data Science: Statistics, big data, and machine learning for Clojure programmers

eBook
€22.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Clojure for Data Science

Chapter 2. Inference

 

"I can see nothing," said I, handing it back to my friend.

"On the contrary, Watson, you can see everything. You fail, however, to reason from what you see. You are too timid in drawing your inferences."

 
 --Sir Arthur Conan Doyle, The Adventure of the Blue Carbuncle

In the previous chapter, we introduced a variety of numerical and visual approaches to understand the normal distribution. We discussed descriptive statistics, such as the mean and standard deviation, and how they can be used to summarize large amounts of data succinctly.

A dataset is usually a sample of some larger population. Sometimes, this population is too large to be measured in its entirety. Sometimes, it is intrinsically unmeasurable, either because it is infinite in size or it otherwise cannot be accessed directly. In either case, we are forced to generalize from the data that we have.

In this chapter, we consider statistical inference: how we can go beyond...

Introducing AcmeContent

To help illustrate the concepts in this chapter, let's imagine that we've recently been appointed for the data scientist role at AcmeContent. The company runs a website that lets visitors share video clips that they've enjoyed online.

One of the metrics AcmeContent tracks through its web analytics is dwell time. This is a measure of how long a visitor stays on the site. Clearly, visitors who spend a long time on the site are enjoying themselves and AcmeContent wants its visitors to stay as long as possible. If the mean dwell time increases, our CEO will be very happy.

Note

Dwell time is the length of time between the time a visitor first arrives at a website and the time they make their last request to your site.

A bounce is a visitor who makes only one request—their dwell time is zero.

As the company's new data scientist, it falls to us to analyze the dwell time reported by the website's analytics and measure the success of AcmeContent...

Download the sample code

The code for this chapter is available at https://github.com/clojuredatascience/ch2-inference or from the Packt Publishing's website.

The example data has been generated specifically for this chapter. It's small enough that it has been included with the book's sample code inside the data directory. Consult the book's wiki at http://wiki.clojuredatascience.com for links to further read about dwell time analysis.

Load and inspect the data

In the previous chapter, we used Incanter to load Excel spreadsheets with the incanter.excel/load-xls function. In this chapter, we will load a dataset from a tab-separated text file. For this, we'll make use of incanter.io/read-dataset that expects to receive either a URL object or a file path represented as a string.

The file has been helpfully reformatted by AcmeContent's web team to contain just two columns—the date of the request and the dwell time in seconds. There are column headings in the first row, so we pass :header true to read-dataset:

(defn load-data [file]
  (-> (io/resource file)
      (iio/read-dataset :header true :delim \tab)))

(defn ex-2-1 []
  (-> (load-data "dwell-times.tsv")
      (i/view)))

If you run this code (either in the REPL or on the command line with lein run –e 2.1), you should see an output similar to the following:

Load and inspect the data

Let's see what the dwell times look like as a histogram.

Visualizing the dwell times

We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

Visualizing the dwell times

This is clearly not a normally distributed data, nor even a very skewed normal distribution. There is no tail to the left of the peak (a visitor clearly can't be on our site for less than zero seconds). While the data tails off steeply to the right at first, it extends much further along the x axis than we would expect from normally distributed data.

When confronted with distributions like this, where values are mostly small but occasionally extreme, it can be useful to plot the y axis as a log scale. Log scales are used to represent events that cover a very large range. Chart axes are ordinarily linear and they partition a range into...

The exponential distribution

The exponential distribution occurs frequently when considering situations where there are many small positive quantities and much fewer larger quantities. Given what we have learned about the Richter scale, it won't be a surprise to learn that the magnitude of earthquakes follows an exponential distribution.

The distribution also frequently occurs in waiting times—the time until the next earthquake of any magnitude roughly follows an exponential distribution as well. The distribution is often used to model failure rates, which is essentially the waiting time until a machine breaks down. Our exponential distribution models a process similar to failure—the waiting time until a visitor gets bored and leaves our site.

The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:

(defn ex-2-4 []
  (let [dwell-times (->> (load-data "dwell-times.tsv")
                         (i...

Introducing AcmeContent


To help illustrate the concepts in this chapter, let's imagine that we've recently been appointed for the data scientist role at AcmeContent. The company runs a website that lets visitors share video clips that they've enjoyed online.

One of the metrics AcmeContent tracks through its web analytics is dwell time. This is a measure of how long a visitor stays on the site. Clearly, visitors who spend a long time on the site are enjoying themselves and AcmeContent wants its visitors to stay as long as possible. If the mean dwell time increases, our CEO will be very happy.

Note

Dwell time is the length of time between the time a visitor first arrives at a website and the time they make their last request to your site.

A bounce is a visitor who makes only one request—their dwell time is zero.

As the company's new data scientist, it falls to us to analyze the dwell time reported by the website's analytics and measure the success of AcmeContent's site.

Download the sample code


The code for this chapter is available at https://github.com/clojuredatascience/ch2-inference or from the Packt Publishing's website.

The example data has been generated specifically for this chapter. It's small enough that it has been included with the book's sample code inside the data directory. Consult the book's wiki at http://wiki.clojuredatascience.com for links to further read about dwell time analysis.

Load and inspect the data


In the previous chapter, we used Incanter to load Excel spreadsheets with the incanter.excel/load-xls function. In this chapter, we will load a dataset from a tab-separated text file. For this, we'll make use of incanter.io/read-dataset that expects to receive either a URL object or a file path represented as a string.

The file has been helpfully reformatted by AcmeContent's web team to contain just two columns—the date of the request and the dwell time in seconds. There are column headings in the first row, so we pass :header true to read-dataset:

(defn load-data [file]
  (-> (io/resource file)
      (iio/read-dataset :header true :delim \tab)))

(defn ex-2-1 []
  (-> (load-data "dwell-times.tsv")
      (i/view)))

If you run this code (either in the REPL or on the command line with lein run –e 2.1), you should see an output similar to the following:

Let's see what the dwell times look like as a histogram.

Visualizing the dwell times


We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

This is clearly not a normally distributed data, nor even a very skewed normal distribution. There is no tail to the left of the peak (a visitor clearly can't be on our site for less than zero seconds). While the data tails off steeply to the right at first, it extends much further along the x axis than we would expect from normally distributed data.

When confronted with distributions like this, where values are mostly small but occasionally extreme, it can be useful to plot the y axis as a log scale. Log scales are used to represent events that cover a very large range. Chart axes are ordinarily linear and they partition a range into equally sized steps like...

The exponential distribution


The exponential distribution occurs frequently when considering situations where there are many small positive quantities and much fewer larger quantities. Given what we have learned about the Richter scale, it won't be a surprise to learn that the magnitude of earthquakes follows an exponential distribution.

The distribution also frequently occurs in waiting times—the time until the next earthquake of any magnitude roughly follows an exponential distribution as well. The distribution is often used to model failure rates, which is essentially the waiting time until a machine breaks down. Our exponential distribution models a process similar to failure—the waiting time until a visitor gets bored and leaves our site.

The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:

(defn ex-2-4 []
  (let [dwell-times (->> (load-data "dwell-times.tsv")
                         (i/$ :dwell-time))]
    (println...

The central limit theorem


We encountered the central limit theorem in the previous chapter when we took samples from a uniform distribution and averaged them. In fact, the central limit theorem works for any distribution of values, provided the distribution has a finite standard deviation.

Note

The central limit theorem states that the distribution of sample means will be normally distributed irrespective of the distribution from which they were calculated.

It doesn't matter that the underlying distribution is exponential—the central limit theorem shows that the mean of random samples taken from any distribution will closely approximate a normal distribution. Let's plot a normal curve over our histogram to see how closely it matches.

To plot a normal curve over our histogram, we have to plot our histogram as a density histogram. This plots the proportion of all the points that have been put in each bucket rather than the frequency. We can then overlay the normal probability density with the...

Standard error


While the standard deviation measures the amount of variation there is within a sample, the standard error measures the amount of variation there is between the means of samples taken from the same population.

Note

The standard error is the standard deviation of the distribution of the sample means.

We have calculated the standard error of dwell time empirically by looking at the previous 6 months of data. But there is an equation that allows us to calculate it from only a single sample:

Here, σx is the standard deviation and n is the sample size. This is unlike the descriptive statistics that we studied in the previous chapter. While they described a single sample, the standard error attempts to describe a property of samples in general—the amount of variation in the sample means that variations can be expected for samples of a given size:

(defn standard-deviation [xs]
  (Math/sqrt (variance xs)))

(defn standard-error [xs]
  (/ (standard-deviation xs)
     (Math/sqrt (count xs...

Samples and populations


The words "sample" and "population" mean something very particular to statisticians. A population is the entire collection of entities that a researcher wishes to understand or draw conclusions about. For example, in the second half of the 19th century, Gregor Johann Mendel, the originator of genetics, recorded observations about pea plants. Although he was studying specific plants in a laboratory, his objective was to understand the underlying mechanisms behind heredity in all possible pea plants.

Note

Statisticians refer to the group of entities from which a sample is drawn as the population, whether or not the objects being studied are people.

Since populations may be large—or in the case of Mendel's pea plants, infinite—we must study representative samples and draw inferences about the population from them. To distinguish the measurable attributes of our samples from the inaccessible attributes of the population, we use the word statistics to refer to the sample...

Confidence intervals


Since the standard error of our sample measures how closely we expect our sample mean to match the population mean, we could also consider the inverse—the standard error measures how closely we expect the population mean to match our measured sample mean. In other words, based on our standard error, we can infer that the population mean lies within some expected range of the sample mean with a certain degree of confidence.

Taken together, the "degree of confidence" and the "expected range" define a confidence interval. While stating confidence intervals, it is fairly standard to state the 95 percent interval—we are 95 percent sure that the population parameter lies within the interval. Of course, there remains a 5 percent possibility that it does not.

Whatever the standard error, 95 percent of the population mean will lie between -1.96 and 1.96 standard deviations of the sample mean. 1.96 is therefore the critical z-value for a 95 percent confidence interval.

Note

The name...

Visualizing different populations


Let's remove the filter for weekdays and plot the daily mean dwell time for both week days and weekends:

(defn ex-2-12 []
  (let [means (->> (load-data "dwell-times.tsv")
                   (with-parsed-date)
                   (mean-dwell-times-by-date)
                   (i/$ :dwell-time))]
    (-> (c/histogram means
                     :x-label "Daily mean dwell time unfiltered (s)"
                     :nbins 20)
        (i/view))))

The code generates the following histogram:

The distribution is no longer a normal distribution. In fact, the distribution is bimodal—there are two peaks. The second smaller peak, which corresponds to the newly added weekend data, is lower both because there are not as many weekend days as weekdays and because the distribution has a larger standard error.

Note

In general, distributions with more than one peak are referred to as multimodal. They can be an indicator that two or more normal distributions have been combined...

Hypothesis testing


Hypothesis testing is a formal process for statisticians and data scientists. The standard approach to hypothesis testing is to define an area of research, decide which variables are necessary to measure what is being studied, and then to set out two competing hypotheses. In order to avoid only looking at the data that confirms our biases, researchers will state their hypothesis clearly ahead of time. Statistics can then be used to confirm or refute this hypothesis, based on the data.

In order to help retain our visitors, designers go to work on a variation of our home page that uses all the latest techniques to keep the attention of our audience. We'd like to be sure that our effort isn't in vain, so we will look for an increase in dwell time on the new site.

Therefore, our research question is "does the new site cause the visitor's dwell time to increase"? We decide that this should be tested with reference to the mean dwell time. Now, we need to set out our two hypotheses...

Testing a new site design


The web team at AcmeContent have been hard at work, developing a new site to encourage visitors to stick around for an extended period of time. They've used all the latest techniques and, as a result, we're pretty confident that the site will show a marked improvement in dwell time.

Rather than launching it to all users at once, AcmeContent would like to test the site on a small sample of visitors first. We've educated them about sample bias, and as a result, the web team diverts a random 5 percent of the site traffic to the new site for one day. The result is provided to us as a single text file containing all the day's traffic. Each row shows the dwell time for a visitor who is given a value of either "0" if they used the original site design, or "1" if they saw the new (and hopefully improved) site.

Performing a z-test

While testing with the confidence intervals previously, we had a single population mean to compare to.

With z-testing, we have the option of comparing...

The t-statistic


While using the t-distribution, we look up the t-statistic. Like the z-statistic, this value quantifies how unlikely a particular observed deviation is. For a dual sample t-test, the t-statistic is calculated in the following way:

Here, is the pooled standard error. We could calculate the pooled standard error in the same way as we did earlier:

However, the equation assumes knowledge of the population parameters σa and σb, which can only be approximated from large samples. The t-test is designed for small samples and does not require us to make assumptions about population variance.

As a result, for the t-test, we write the pooled standard error as the square root of the sum of the standard errors:

In practice, the earlier two equations for the pooled standard error yield identical results, given the same input sequences. The difference in notation just serves to illustrate that with the t-test, we depend only on sample statistics as input. The pooled standard error can be...

Performing the t-test


The difference in the way t-test works stems from the probability distribution from which our p-value is calculated. Having calculated our t-statistic, we need to look up the value in the t-distribution parameterized by the degrees of freedom of our data:

(defn t-test [a b]
  (let [df (+ (count a) (count b) -2)]
    (- 1 (s/cdf-t (i/abs (t-stat a b)) :df df))))

The degrees of freedom are two less than the sizes of the samples combined, which is 298 for our samples.

Recall that we are performing a hypothesis test. So, let's state our null and alternate hypotheses:

  • H0: This sample is drawn from a population with a supplied mean

  • H1: This sample is drawn from a population with a greater mean

Let's run the example:

(defn ex-2-16 []
  (let [data (->> (load-data "new-site.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial map :dwell-time)))
        a (get data 0)
        b (get data 1)]
    (t-test a b)))

;; 0.0503

This...

One-sample t-test


Independent samples of t-tests are the most common sort of statistical analysis, which provide a very flexible and generic way of comparing whether two samples represent the same or different population. However, in cases where the population mean is already known, there is an even simpler test represented by s/simple-t-test.

We pass a sample and a population mean to test against with the :mu keyword. So, if we simply want to see whether our new site is significantly different from the previous population mean dwell time of 90s, we can run a test like this:

(defn ex-2-18 []
  (let [data (->> (load-data "new-site.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial map :dwell-time)))
        b (get data 1)]
    (clojure.pprint/pprint (s/t-test b :mu 90))))

;; {:p-value 0.13789520958229406,
;;  :df 15,
;;  :n2 nil,
;;  :x-mean 122.0,
;;  :y-mean nil,
;;  :x-var 6669.866666666667,
;;  :conf-int [78.48152745280898 165...

Resampling


To develop an intuition as to how the t-test can confirm and calculate these statistics from so little data, we can apply an approach called resampling. Resampling is based on the premise that each sample is just one of an infinite number of possible samples from a population. We can gain an insight into the nature of what these other samples could have been, and therefore have a better understanding of the underlying population, by taking many new samples from our existing sample.

There are actually several resampling techniques, and we'll discuss one of the simplest—bootstrapping. In bootstrapping, we generate a new sample by repeatedly taking a random value from the original sample with replacement until we generate a sample that is of the same size as the original. Because these values are replaced between each random selection, the same source value can appear multiple times in the new sample. It is as if we were drawing a random card from a deck of playing cards repeatedly...

Testing multiple designs


It's been disappointing to discover that there is no statistical significance behind the increased dwell time of users on the new site design. Better that we discovered this on a small sample of users before we rolled it out to the world though.

Not to be discouraged, AcmeContent's web team works overtime and devises a suite of alternative site designs. Taking the best elements from the other designs, they devise 19 variations to be tested. Together with our original site, which will act as a control, there are 20 different sites to direct visitors to.

Calculating sample means

The web team deploys the 19 new site designs alongside the original site. As mentioned earlier, each receives a random 5 percent of the visitors. We let the test run for 24 hours.

The next day, we receive a file that shows the dwell times for visitors to each of the site designs. Each has been labeled with a number, with site 0 corresponding to the original unaltered design, and numbers 1 to 19...

Multiple comparisons


The fact that with repeated trials, we increase the probability of discovering a significant effect is called the multiple comparisons problem. In general, the solution to the problem is to demand more significant effects when comparing many samples. There is no straightforward solution to this issue though; even with an α of 0.01, we will make a Type I error on an average of 1 percent of the time.

To develop our intuition about how multiple comparisons and statistical significance relate to each other, let's build an interactive web page to simulate the effect of taking multiple samples. It's one of the advantages of using a powerful and general-purpose programming language like Clojure for data analysis that we can run our data processing code in a diverse array of environments.

The code we've written and run so far for this chapter has been compiled for the Java Virtual Machine. But since 2013, there has been an alternative target environment for our compiled code:...

The browser simulation


An HTML page has been supplied in the resources directory of the sample project. Open the page in any modern browser and you should see something similar to the following image:

The left of the page shows a dual histogram with the distribution of two samples, both taken from an exponential distribution. The means of the populations from which the samples are generated are controlled by the sliders at the top right corner of the web page in the box marked as Parameters. Underneath the histogram is a plot showing the two probability densities for the population means based on the samples. These are calculated using the t-distribution, parameterized by the degrees of freedom of the sample. Below these sliders, in a box marked as Settings, are another pair of sliders that set the sample size and confidence intervals for the test. Adjusting the confidence intervals will crop the tails of the t-distributions; at the 95 percent confidence interval, only the central 95 percent...

jStat


As ClojureScript compiles to JavaScript, we can't make use of the libraries that have Java dependencies. Incanter is heavily reliant on several underlying Java libraries, so we have to find an alternative to Incanter for our browser-based statistical analysis.

Note

While building ClojureScript applications, we can't make use of the libraries that depend on Java libraries, as they won't be available in the JavaScript engine which executes our code.

jStat (https://github.com/jstat/jstat) is a JavaScript statistical library. It provides functions to generate sequences according to specific distributions, including the exponential and t-distributions.

To use it, we have to make sure it's available on our webpage. We can do this either by linking it to a remote content distribution network (CDN) or by hosting the file ourselves. The advantage of linking it to a CDN is that visitors, who previously downloaded jStat for another website, can make use of their cached version. However, since our...

B1


Now that we can generate samples of data in ClojureScript, we'd like to be able to plot them on a histogram. We need a pure Clojure alternative to Incanter that will draw histograms in a web-accessible format; the B1 library (https://github.com/henrygarner/b1) provides just this functionality. The name is derived from the fact that it is adapted and simplified from the ClojureScript library C2, which in turn is a simplification of the popular JavaScript data visualization framework D3.

We'll be using B1's simple utility functions in b1.charts to build histograms out of our data in ClojureScript. B1 does not mandate a particular display format; we could use it to draw on a canvas element or even to build diagrams directly out of the HTML elements. However, B1 does contain functions to convert charts to SVG in b1.svg and these can be displayed in all modern web browsers.

Scalable Vector Graphics

SVG stands for Scalable Vector Graphics and defines a set of tags that represent drawing instructions...

Plotting probability densities


In addition to using jStat to generate samples from the exponential distribution, we'll also use it to calculate the probability density for the t-distribution. We can construct a simple function to wrap the jStat.studentt.pdf(t, df) function, providing the correct t-statistic and degrees of freedom to parameterize the distribution:

(defn pdf-t [t & {:keys [df]}]
  (js/jStat.studentt.pdf t df))

An advantage of using ClojureScript is that we have already written the code to calculate the t-statistic from a sample. The code, which worked in Clojure, can be compiled to ClojureScript with no changes whatsoever:

(defn t-statistic [test {:keys [mean n sd]}]
  (/ (- mean test)
     (/ sd (Math/sqrt n))))

To render the probability density, we can use B1's c/function-area-plot. This will generate an area plot from the line described by a function. The provided function simply needs to accept an x and return the corresponding y.

A slight complication is that the value...

State and Reagent


State in ClojureScript is managed in the same way as Clojure applications—through the use of atoms, refs, or agents. Atoms provide uncoordinated, synchronous access to a single identity and are an excellent choice for storing the application state. Using an atom ensures that the application always sees a single, consistent view of the data.

Reagent is a ClojureScript library that provides a mechanism to update the content of a web page in response to changing the value of an atom. Markup and state are bound together, so that markup is regenerated whenever the application state is updated.

Reagent also provides syntax to render HTML in an idiomatic way using Clojure data structures. This means that both the content and the interactivity of the page can be handled in one language.

Updating state

With data held in a Reagent atom, updating the state is achieved by calling the swap! function with two arguments—the atom we wish to update and a function to transform the state of the...

Simulating multiple tests


Each time the New Sample button is pressed, a pair of new samples from an exponential distribution with population means taken from the sliders are generated. The samples are plotted on a histogram and, underneath, a probability density function is drawn showing the standard error for the sample. As the confidence intervals are changed, observe how the acceptable deviation of the standard error changes as well.

Each time the button is pressed, we could think of it as a significance test with an alpha set to the complement of the confidence interval. In other words, if the probability distributions for the sample means overlap at the 95 percent confidence interval, we cannot reject the null hypothesis at the 5 percent significance level.

Observe how, even when the population means are identical, occasional large deviations in the means will occur. Where samples differ by more than our standard error, we can accept the alternate hypothesis. With a confidence level of...

The Bonferroni correction


We therefore require an alternative approach while conducting multiple tests that will account for an increased probability of discovering a significant effect through repeated trials. The Bonferroni correction is a very simple adjustment that ensures we are unlikely to make Type I errors. It does this by adjusting the alpha for our tests.

The adjustment is a simple one—the Bonferroni correction simply divides our desired alpha by the number of tests we are performing. For example, if we had k site designs to test and an experimental alpha of 0.05, the Bonferroni correction is expressed as:

This is a safe way to mitigate the increased probability of making a Type I error in multiple testing. The following example is identical to ex-2-22, except the alpha value has been divided by the number of groups:

(defn ex-2-23 []
  (let [data (->> (load-data "multiple-sites.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial...

Analysis of variance


Analysis of variance, often shortened to ANOVA, is a series of statistical methods used to measure the statistical significance of the difference between groups. It was developed by Ronald Fisher, an extremely gifted statistician, who also popularized significance testing through his work on biological testing.

Our tests, using the z-statistic and t-statistic, have focused on sample means as the primary mechanism to draw a distinction between the two samples. In each case, we looked for a difference in the means divided by the level of difference we could reasonably expect and quantified by the standard error.

The mean isn't the only statistic that might indicate a difference between samples. In fact, it is also possible to use the sample variance as an indicator of statistical difference.

To illustrate how this might work, consider the preceding diagram. Each of the three groups on the left could represent samples of dwell times for a specific page with its own mean and...

Left arrow icon Right arrow icon

Description

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

What you will learn

  • Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence
  • Implement the core machine learning techniques of regression, classification, clustering and recommendation
  • Understand the importance of the value of simple statistics and distributions in exploratory data analysis
  • Scale algorithms to websized datasets efficiently using distributed programming models on Hadoop and Spark
  • Apply suitable analytic approaches for text, graph, and time series data
  • Interpret the terminology that you will encounter in technical papers
  • Import libraries from other JVM languages such as Java and Scala
  • Communicate your findings clearly and convincingly to nontechnical colleagues

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 03, 2015
Length: 608 pages
Edition : 1st
Language : English
ISBN-13 : 9781784397500
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Sep 03, 2015
Length: 608 pages
Edition : 1st
Language : English
ISBN-13 : 9781784397500
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 115.97
Clojure Data Analysis Cookbook - Second Edition
€45.99
Clojure for Data Science
€41.99
Clojure Data Structures and Algorithms Cookbook
€27.99
Total 115.97 Stars icon

Table of Contents

11 Chapters
1. Statistics Chevron down icon Chevron up icon
2. Inference Chevron down icon Chevron up icon
3. Correlation Chevron down icon Chevron up icon
4. Classification Chevron down icon Chevron up icon
5. Big Data Chevron down icon Chevron up icon
6. Clustering Chevron down icon Chevron up icon
7. Recommender Systems Chevron down icon Chevron up icon
8. Network Analysis Chevron down icon Chevron up icon
9. Time Series Chevron down icon Chevron up icon
10. Visualization Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(4 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Stephen Walker Jan 02, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I almost never write reviews (maybe never), but felt that this book deserves more attention. It provides a solid intuition to data science in clojure. Well written and nice to follow through the examples in each chapter. I only hope to hear of more books from Henry Garner!! Selfishly with more depth in time series analysis and online processing of that data.
Amazon Verified review Amazon
Dame Edna May 16, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
One of the best books for learning data science. Very thorough, practical, well written and interesting.
Amazon Verified review Amazon
madeinquant Jun 03, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is the best to learn Clojure and data science; Clojure is a unique programming language and it is not a popular programming language, learning Clojure is easy at the beginning but it is very difficult to solve a real world problem. Once you familiarize with Clojure, you will respect the power of LISP (Clojure is a dialect of LISP, Why Clojure?, Uncle Bob) Fortunately, I did a lot of old school programming (i.e. ANSI C, C++, LISP), since there are a lot of original concepts of LISP but learning Clojure is challenging.If you expect to cut and paste the code of your programming, this may not be suitable for you. I was programming data science and algorithms in javascript, C and programming data science in python 3. Python is beautiful, effective and the community has grown, you can find almost all useful data science libraries by googling, however, almost all libraries are difficult to learn the algorithm inside the box even though they are open-source. Learning the algorithm from scratch is a nightmare, I learn to code in Clojure and to migrate an existing code into Clojure. There are a lot of headaches during algorithms and Clojure learning, this book helps me a lot to resolve problems, all that said, this book is suitable for readers who have experienced a lot of programming, read the algorithms, get the concepts and write the algorithms in your own familiar programming language.
Amazon Verified review Amazon
skliarpawlo Nov 16, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great and very useful book for beginners both in Clojure and Data Science. So glad I ordered it
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.