Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Clojure for Data Science
Clojure for Data Science

Clojure for Data Science: Statistics, big data, and machine learning for Clojure programmers

eBook
R$80 R$245.99
Paperback
R$306.99
Subscription
Free Trial
Renews at R$50p/m

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Clojure for Data Science

Chapter 2. Inference

 

"I can see nothing," said I, handing it back to my friend.

"On the contrary, Watson, you can see everything. You fail, however, to reason from what you see. You are too timid in drawing your inferences."

 
 --Sir Arthur Conan Doyle, The Adventure of the Blue Carbuncle

In the previous chapter, we introduced a variety of numerical and visual approaches to understand the normal distribution. We discussed descriptive statistics, such as the mean and standard deviation, and how they can be used to summarize large amounts of data succinctly.

A dataset is usually a sample of some larger population. Sometimes, this population is too large to be measured in its entirety. Sometimes, it is intrinsically unmeasurable, either because it is infinite in size or it otherwise cannot be accessed directly. In either case, we are forced to generalize from the data that we have.

In this chapter, we consider statistical inference: how we can go beyond...

Introducing AcmeContent

To help illustrate the concepts in this chapter, let's imagine that we've recently been appointed for the data scientist role at AcmeContent. The company runs a website that lets visitors share video clips that they've enjoyed online.

One of the metrics AcmeContent tracks through its web analytics is dwell time. This is a measure of how long a visitor stays on the site. Clearly, visitors who spend a long time on the site are enjoying themselves and AcmeContent wants its visitors to stay as long as possible. If the mean dwell time increases, our CEO will be very happy.

Note

Dwell time is the length of time between the time a visitor first arrives at a website and the time they make their last request to your site.

A bounce is a visitor who makes only one request—their dwell time is zero.

As the company's new data scientist, it falls to us to analyze the dwell time reported by the website's analytics and measure the success of AcmeContent...

Download the sample code

The code for this chapter is available at https://github.com/clojuredatascience/ch2-inference or from the Packt Publishing's website.

The example data has been generated specifically for this chapter. It's small enough that it has been included with the book's sample code inside the data directory. Consult the book's wiki at http://wiki.clojuredatascience.com for links to further read about dwell time analysis.

Load and inspect the data

In the previous chapter, we used Incanter to load Excel spreadsheets with the incanter.excel/load-xls function. In this chapter, we will load a dataset from a tab-separated text file. For this, we'll make use of incanter.io/read-dataset that expects to receive either a URL object or a file path represented as a string.

The file has been helpfully reformatted by AcmeContent's web team to contain just two columns—the date of the request and the dwell time in seconds. There are column headings in the first row, so we pass :header true to read-dataset:

(defn load-data [file]
  (-> (io/resource file)
      (iio/read-dataset :header true :delim \tab)))

(defn ex-2-1 []
  (-> (load-data "dwell-times.tsv")
      (i/view)))

If you run this code (either in the REPL or on the command line with lein run –e 2.1), you should see an output similar to the following:

Load and inspect the data

Let's see what the dwell times look like as a histogram.

Visualizing the dwell times

We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

Visualizing the dwell times

This is clearly not a normally distributed data, nor even a very skewed normal distribution. There is no tail to the left of the peak (a visitor clearly can't be on our site for less than zero seconds). While the data tails off steeply to the right at first, it extends much further along the x axis than we would expect from normally distributed data.

When confronted with distributions like this, where values are mostly small but occasionally extreme, it can be useful to plot the y axis as a log scale. Log scales are used to represent events that cover a very large range. Chart axes are ordinarily linear and they partition a range into...

The exponential distribution

The exponential distribution occurs frequently when considering situations where there are many small positive quantities and much fewer larger quantities. Given what we have learned about the Richter scale, it won't be a surprise to learn that the magnitude of earthquakes follows an exponential distribution.

The distribution also frequently occurs in waiting times—the time until the next earthquake of any magnitude roughly follows an exponential distribution as well. The distribution is often used to model failure rates, which is essentially the waiting time until a machine breaks down. Our exponential distribution models a process similar to failure—the waiting time until a visitor gets bored and leaves our site.

The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:

(defn ex-2-4 []
  (let [dwell-times (->> (load-data "dwell-times.tsv")
                         (i...

Introducing AcmeContent


To help illustrate the concepts in this chapter, let's imagine that we've recently been appointed for the data scientist role at AcmeContent. The company runs a website that lets visitors share video clips that they've enjoyed online.

One of the metrics AcmeContent tracks through its web analytics is dwell time. This is a measure of how long a visitor stays on the site. Clearly, visitors who spend a long time on the site are enjoying themselves and AcmeContent wants its visitors to stay as long as possible. If the mean dwell time increases, our CEO will be very happy.

Note

Dwell time is the length of time between the time a visitor first arrives at a website and the time they make their last request to your site.

A bounce is a visitor who makes only one request—their dwell time is zero.

As the company's new data scientist, it falls to us to analyze the dwell time reported by the website's analytics and measure the success of AcmeContent's site.

Download the sample code


The code for this chapter is available at https://github.com/clojuredatascience/ch2-inference or from the Packt Publishing's website.

The example data has been generated specifically for this chapter. It's small enough that it has been included with the book's sample code inside the data directory. Consult the book's wiki at http://wiki.clojuredatascience.com for links to further read about dwell time analysis.

Load and inspect the data


In the previous chapter, we used Incanter to load Excel spreadsheets with the incanter.excel/load-xls function. In this chapter, we will load a dataset from a tab-separated text file. For this, we'll make use of incanter.io/read-dataset that expects to receive either a URL object or a file path represented as a string.

The file has been helpfully reformatted by AcmeContent's web team to contain just two columns—the date of the request and the dwell time in seconds. There are column headings in the first row, so we pass :header true to read-dataset:

(defn load-data [file]
  (-> (io/resource file)
      (iio/read-dataset :header true :delim \tab)))

(defn ex-2-1 []
  (-> (load-data "dwell-times.tsv")
      (i/view)))

If you run this code (either in the REPL or on the command line with lein run –e 2.1), you should see an output similar to the following:

Let's see what the dwell times look like as a histogram.

Visualizing the dwell times


We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

This is clearly not a normally distributed data, nor even a very skewed normal distribution. There is no tail to the left of the peak (a visitor clearly can't be on our site for less than zero seconds). While the data tails off steeply to the right at first, it extends much further along the x axis than we would expect from normally distributed data.

When confronted with distributions like this, where values are mostly small but occasionally extreme, it can be useful to plot the y axis as a log scale. Log scales are used to represent events that cover a very large range. Chart axes are ordinarily linear and they partition a range into equally sized steps like...

The exponential distribution


The exponential distribution occurs frequently when considering situations where there are many small positive quantities and much fewer larger quantities. Given what we have learned about the Richter scale, it won't be a surprise to learn that the magnitude of earthquakes follows an exponential distribution.

The distribution also frequently occurs in waiting times—the time until the next earthquake of any magnitude roughly follows an exponential distribution as well. The distribution is often used to model failure rates, which is essentially the waiting time until a machine breaks down. Our exponential distribution models a process similar to failure—the waiting time until a visitor gets bored and leaves our site.

The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:

(defn ex-2-4 []
  (let [dwell-times (->> (load-data "dwell-times.tsv")
                         (i/$ :dwell-time))]
    (println...

The central limit theorem


We encountered the central limit theorem in the previous chapter when we took samples from a uniform distribution and averaged them. In fact, the central limit theorem works for any distribution of values, provided the distribution has a finite standard deviation.

Note

The central limit theorem states that the distribution of sample means will be normally distributed irrespective of the distribution from which they were calculated.

It doesn't matter that the underlying distribution is exponential—the central limit theorem shows that the mean of random samples taken from any distribution will closely approximate a normal distribution. Let's plot a normal curve over our histogram to see how closely it matches.

To plot a normal curve over our histogram, we have to plot our histogram as a density histogram. This plots the proportion of all the points that have been put in each bucket rather than the frequency. We can then overlay the normal probability density with the...

Standard error


While the standard deviation measures the amount of variation there is within a sample, the standard error measures the amount of variation there is between the means of samples taken from the same population.

Note

The standard error is the standard deviation of the distribution of the sample means.

We have calculated the standard error of dwell time empirically by looking at the previous 6 months of data. But there is an equation that allows us to calculate it from only a single sample:

Here, σx is the standard deviation and n is the sample size. This is unlike the descriptive statistics that we studied in the previous chapter. While they described a single sample, the standard error attempts to describe a property of samples in general—the amount of variation in the sample means that variations can be expected for samples of a given size:

(defn standard-deviation [xs]
  (Math/sqrt (variance xs)))

(defn standard-error [xs]
  (/ (standard-deviation xs)
     (Math/sqrt (count xs...

Samples and populations


The words "sample" and "population" mean something very particular to statisticians. A population is the entire collection of entities that a researcher wishes to understand or draw conclusions about. For example, in the second half of the 19th century, Gregor Johann Mendel, the originator of genetics, recorded observations about pea plants. Although he was studying specific plants in a laboratory, his objective was to understand the underlying mechanisms behind heredity in all possible pea plants.

Note

Statisticians refer to the group of entities from which a sample is drawn as the population, whether or not the objects being studied are people.

Since populations may be large—or in the case of Mendel's pea plants, infinite—we must study representative samples and draw inferences about the population from them. To distinguish the measurable attributes of our samples from the inaccessible attributes of the population, we use the word statistics to refer to the sample...

Confidence intervals


Since the standard error of our sample measures how closely we expect our sample mean to match the population mean, we could also consider the inverse—the standard error measures how closely we expect the population mean to match our measured sample mean. In other words, based on our standard error, we can infer that the population mean lies within some expected range of the sample mean with a certain degree of confidence.

Taken together, the "degree of confidence" and the "expected range" define a confidence interval. While stating confidence intervals, it is fairly standard to state the 95 percent interval—we are 95 percent sure that the population parameter lies within the interval. Of course, there remains a 5 percent possibility that it does not.

Whatever the standard error, 95 percent of the population mean will lie between -1.96 and 1.96 standard deviations of the sample mean. 1.96 is therefore the critical z-value for a 95 percent confidence interval.

Note

The name...

Visualizing different populations


Let's remove the filter for weekdays and plot the daily mean dwell time for both week days and weekends:

(defn ex-2-12 []
  (let [means (->> (load-data "dwell-times.tsv")
                   (with-parsed-date)
                   (mean-dwell-times-by-date)
                   (i/$ :dwell-time))]
    (-> (c/histogram means
                     :x-label "Daily mean dwell time unfiltered (s)"
                     :nbins 20)
        (i/view))))

The code generates the following histogram:

The distribution is no longer a normal distribution. In fact, the distribution is bimodal—there are two peaks. The second smaller peak, which corresponds to the newly added weekend data, is lower both because there are not as many weekend days as weekdays and because the distribution has a larger standard error.

Note

In general, distributions with more than one peak are referred to as multimodal. They can be an indicator that two or more normal distributions have been combined...

Hypothesis testing


Hypothesis testing is a formal process for statisticians and data scientists. The standard approach to hypothesis testing is to define an area of research, decide which variables are necessary to measure what is being studied, and then to set out two competing hypotheses. In order to avoid only looking at the data that confirms our biases, researchers will state their hypothesis clearly ahead of time. Statistics can then be used to confirm or refute this hypothesis, based on the data.

In order to help retain our visitors, designers go to work on a variation of our home page that uses all the latest techniques to keep the attention of our audience. We'd like to be sure that our effort isn't in vain, so we will look for an increase in dwell time on the new site.

Therefore, our research question is "does the new site cause the visitor's dwell time to increase"? We decide that this should be tested with reference to the mean dwell time. Now, we need to set out our two hypotheses...

Testing a new site design


The web team at AcmeContent have been hard at work, developing a new site to encourage visitors to stick around for an extended period of time. They've used all the latest techniques and, as a result, we're pretty confident that the site will show a marked improvement in dwell time.

Rather than launching it to all users at once, AcmeContent would like to test the site on a small sample of visitors first. We've educated them about sample bias, and as a result, the web team diverts a random 5 percent of the site traffic to the new site for one day. The result is provided to us as a single text file containing all the day's traffic. Each row shows the dwell time for a visitor who is given a value of either "0" if they used the original site design, or "1" if they saw the new (and hopefully improved) site.

Performing a z-test

While testing with the confidence intervals previously, we had a single population mean to compare to.

With z-testing, we have the option of comparing...

The t-statistic


While using the t-distribution, we look up the t-statistic. Like the z-statistic, this value quantifies how unlikely a particular observed deviation is. For a dual sample t-test, the t-statistic is calculated in the following way:

Here, is the pooled standard error. We could calculate the pooled standard error in the same way as we did earlier:

However, the equation assumes knowledge of the population parameters σa and σb, which can only be approximated from large samples. The t-test is designed for small samples and does not require us to make assumptions about population variance.

As a result, for the t-test, we write the pooled standard error as the square root of the sum of the standard errors:

In practice, the earlier two equations for the pooled standard error yield identical results, given the same input sequences. The difference in notation just serves to illustrate that with the t-test, we depend only on sample statistics as input. The pooled standard error can be...

Performing the t-test


The difference in the way t-test works stems from the probability distribution from which our p-value is calculated. Having calculated our t-statistic, we need to look up the value in the t-distribution parameterized by the degrees of freedom of our data:

(defn t-test [a b]
  (let [df (+ (count a) (count b) -2)]
    (- 1 (s/cdf-t (i/abs (t-stat a b)) :df df))))

The degrees of freedom are two less than the sizes of the samples combined, which is 298 for our samples.

Recall that we are performing a hypothesis test. So, let's state our null and alternate hypotheses:

  • H0: This sample is drawn from a population with a supplied mean

  • H1: This sample is drawn from a population with a greater mean

Let's run the example:

(defn ex-2-16 []
  (let [data (->> (load-data "new-site.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial map :dwell-time)))
        a (get data 0)
        b (get data 1)]
    (t-test a b)))

;; 0.0503

This...

One-sample t-test


Independent samples of t-tests are the most common sort of statistical analysis, which provide a very flexible and generic way of comparing whether two samples represent the same or different population. However, in cases where the population mean is already known, there is an even simpler test represented by s/simple-t-test.

We pass a sample and a population mean to test against with the :mu keyword. So, if we simply want to see whether our new site is significantly different from the previous population mean dwell time of 90s, we can run a test like this:

(defn ex-2-18 []
  (let [data (->> (load-data "new-site.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial map :dwell-time)))
        b (get data 1)]
    (clojure.pprint/pprint (s/t-test b :mu 90))))

;; {:p-value 0.13789520958229406,
;;  :df 15,
;;  :n2 nil,
;;  :x-mean 122.0,
;;  :y-mean nil,
;;  :x-var 6669.866666666667,
;;  :conf-int [78.48152745280898 165...

Resampling


To develop an intuition as to how the t-test can confirm and calculate these statistics from so little data, we can apply an approach called resampling. Resampling is based on the premise that each sample is just one of an infinite number of possible samples from a population. We can gain an insight into the nature of what these other samples could have been, and therefore have a better understanding of the underlying population, by taking many new samples from our existing sample.

There are actually several resampling techniques, and we'll discuss one of the simplest—bootstrapping. In bootstrapping, we generate a new sample by repeatedly taking a random value from the original sample with replacement until we generate a sample that is of the same size as the original. Because these values are replaced between each random selection, the same source value can appear multiple times in the new sample. It is as if we were drawing a random card from a deck of playing cards repeatedly...

Testing multiple designs


It's been disappointing to discover that there is no statistical significance behind the increased dwell time of users on the new site design. Better that we discovered this on a small sample of users before we rolled it out to the world though.

Not to be discouraged, AcmeContent's web team works overtime and devises a suite of alternative site designs. Taking the best elements from the other designs, they devise 19 variations to be tested. Together with our original site, which will act as a control, there are 20 different sites to direct visitors to.

Calculating sample means

The web team deploys the 19 new site designs alongside the original site. As mentioned earlier, each receives a random 5 percent of the visitors. We let the test run for 24 hours.

The next day, we receive a file that shows the dwell times for visitors to each of the site designs. Each has been labeled with a number, with site 0 corresponding to the original unaltered design, and numbers 1 to 19...

Multiple comparisons


The fact that with repeated trials, we increase the probability of discovering a significant effect is called the multiple comparisons problem. In general, the solution to the problem is to demand more significant effects when comparing many samples. There is no straightforward solution to this issue though; even with an α of 0.01, we will make a Type I error on an average of 1 percent of the time.

To develop our intuition about how multiple comparisons and statistical significance relate to each other, let's build an interactive web page to simulate the effect of taking multiple samples. It's one of the advantages of using a powerful and general-purpose programming language like Clojure for data analysis that we can run our data processing code in a diverse array of environments.

The code we've written and run so far for this chapter has been compiled for the Java Virtual Machine. But since 2013, there has been an alternative target environment for our compiled code:...

The browser simulation


An HTML page has been supplied in the resources directory of the sample project. Open the page in any modern browser and you should see something similar to the following image:

The left of the page shows a dual histogram with the distribution of two samples, both taken from an exponential distribution. The means of the populations from which the samples are generated are controlled by the sliders at the top right corner of the web page in the box marked as Parameters. Underneath the histogram is a plot showing the two probability densities for the population means based on the samples. These are calculated using the t-distribution, parameterized by the degrees of freedom of the sample. Below these sliders, in a box marked as Settings, are another pair of sliders that set the sample size and confidence intervals for the test. Adjusting the confidence intervals will crop the tails of the t-distributions; at the 95 percent confidence interval, only the central 95 percent...

jStat


As ClojureScript compiles to JavaScript, we can't make use of the libraries that have Java dependencies. Incanter is heavily reliant on several underlying Java libraries, so we have to find an alternative to Incanter for our browser-based statistical analysis.

Note

While building ClojureScript applications, we can't make use of the libraries that depend on Java libraries, as they won't be available in the JavaScript engine which executes our code.

jStat (https://github.com/jstat/jstat) is a JavaScript statistical library. It provides functions to generate sequences according to specific distributions, including the exponential and t-distributions.

To use it, we have to make sure it's available on our webpage. We can do this either by linking it to a remote content distribution network (CDN) or by hosting the file ourselves. The advantage of linking it to a CDN is that visitors, who previously downloaded jStat for another website, can make use of their cached version. However, since our...

B1


Now that we can generate samples of data in ClojureScript, we'd like to be able to plot them on a histogram. We need a pure Clojure alternative to Incanter that will draw histograms in a web-accessible format; the B1 library (https://github.com/henrygarner/b1) provides just this functionality. The name is derived from the fact that it is adapted and simplified from the ClojureScript library C2, which in turn is a simplification of the popular JavaScript data visualization framework D3.

We'll be using B1's simple utility functions in b1.charts to build histograms out of our data in ClojureScript. B1 does not mandate a particular display format; we could use it to draw on a canvas element or even to build diagrams directly out of the HTML elements. However, B1 does contain functions to convert charts to SVG in b1.svg and these can be displayed in all modern web browsers.

Scalable Vector Graphics

SVG stands for Scalable Vector Graphics and defines a set of tags that represent drawing instructions...

Plotting probability densities


In addition to using jStat to generate samples from the exponential distribution, we'll also use it to calculate the probability density for the t-distribution. We can construct a simple function to wrap the jStat.studentt.pdf(t, df) function, providing the correct t-statistic and degrees of freedom to parameterize the distribution:

(defn pdf-t [t & {:keys [df]}]
  (js/jStat.studentt.pdf t df))

An advantage of using ClojureScript is that we have already written the code to calculate the t-statistic from a sample. The code, which worked in Clojure, can be compiled to ClojureScript with no changes whatsoever:

(defn t-statistic [test {:keys [mean n sd]}]
  (/ (- mean test)
     (/ sd (Math/sqrt n))))

To render the probability density, we can use B1's c/function-area-plot. This will generate an area plot from the line described by a function. The provided function simply needs to accept an x and return the corresponding y.

A slight complication is that the value...

State and Reagent


State in ClojureScript is managed in the same way as Clojure applications—through the use of atoms, refs, or agents. Atoms provide uncoordinated, synchronous access to a single identity and are an excellent choice for storing the application state. Using an atom ensures that the application always sees a single, consistent view of the data.

Reagent is a ClojureScript library that provides a mechanism to update the content of a web page in response to changing the value of an atom. Markup and state are bound together, so that markup is regenerated whenever the application state is updated.

Reagent also provides syntax to render HTML in an idiomatic way using Clojure data structures. This means that both the content and the interactivity of the page can be handled in one language.

Updating state

With data held in a Reagent atom, updating the state is achieved by calling the swap! function with two arguments—the atom we wish to update and a function to transform the state of the...

Simulating multiple tests


Each time the New Sample button is pressed, a pair of new samples from an exponential distribution with population means taken from the sliders are generated. The samples are plotted on a histogram and, underneath, a probability density function is drawn showing the standard error for the sample. As the confidence intervals are changed, observe how the acceptable deviation of the standard error changes as well.

Each time the button is pressed, we could think of it as a significance test with an alpha set to the complement of the confidence interval. In other words, if the probability distributions for the sample means overlap at the 95 percent confidence interval, we cannot reject the null hypothesis at the 5 percent significance level.

Observe how, even when the population means are identical, occasional large deviations in the means will occur. Where samples differ by more than our standard error, we can accept the alternate hypothesis. With a confidence level of...

The Bonferroni correction


We therefore require an alternative approach while conducting multiple tests that will account for an increased probability of discovering a significant effect through repeated trials. The Bonferroni correction is a very simple adjustment that ensures we are unlikely to make Type I errors. It does this by adjusting the alpha for our tests.

The adjustment is a simple one—the Bonferroni correction simply divides our desired alpha by the number of tests we are performing. For example, if we had k site designs to test and an experimental alpha of 0.05, the Bonferroni correction is expressed as:

This is a safe way to mitigate the increased probability of making a Type I error in multiple testing. The following example is identical to ex-2-22, except the alpha value has been divided by the number of groups:

(defn ex-2-23 []
  (let [data (->> (load-data "multiple-sites.tsv")
                  (:rows)
                  (group-by :site)
                  (map-vals (partial...

Analysis of variance


Analysis of variance, often shortened to ANOVA, is a series of statistical methods used to measure the statistical significance of the difference between groups. It was developed by Ronald Fisher, an extremely gifted statistician, who also popularized significance testing through his work on biological testing.

Our tests, using the z-statistic and t-statistic, have focused on sample means as the primary mechanism to draw a distinction between the two samples. In each case, we looked for a difference in the means divided by the level of difference we could reasonably expect and quantified by the standard error.

The mean isn't the only statistic that might indicate a difference between samples. In fact, it is also possible to use the sample variance as an indicator of statistical difference.

To illustrate how this might work, consider the preceding diagram. Each of the three groups on the left could represent samples of dwell times for a specific page with its own mean and...

Left arrow icon Right arrow icon

Description

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

What you will learn

  • Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence
  • Implement the core machine learning techniques of regression, classification, clustering and recommendation
  • Understand the importance of the value of simple statistics and distributions in exploratory data analysis
  • Scale algorithms to websized datasets efficiently using distributed programming models on Hadoop and Spark
  • Apply suitable analytic approaches for text, graph, and time series data
  • Interpret the terminology that you will encounter in technical papers
  • Import libraries from other JVM languages such as Java and Scala
  • Communicate your findings clearly and convincingly to nontechnical colleagues
Estimated delivery fee Deliver to Brazil

Standard delivery 10 - 13 business days

R$63.95

Premium delivery 3 - 6 business days

R$203.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 03, 2015
Length: 608 pages
Edition : 1st
Language : English
ISBN-13 : 9781784397180
Category :
Languages :
Concepts :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Brazil

Standard delivery 10 - 13 business days

R$63.95

Premium delivery 3 - 6 business days

R$203.95
(Includes tracking information)

Product Details

Publication date : Sep 03, 2015
Length: 608 pages
Edition : 1st
Language : English
ISBN-13 : 9781784397180
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
R$50 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
R$500 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts
R$800 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total R$ 853.97
Clojure Data Analysis Cookbook - Second Edition
R$339.99
Clojure Data Structures and Algorithms Cookbook
R$206.99
Clojure for Data Science
R$306.99
Total R$ 853.97 Stars icon

Table of Contents

11 Chapters
1. Statistics Chevron down icon Chevron up icon
2. Inference Chevron down icon Chevron up icon
3. Correlation Chevron down icon Chevron up icon
4. Classification Chevron down icon Chevron up icon
5. Big Data Chevron down icon Chevron up icon
6. Clustering Chevron down icon Chevron up icon
7. Recommender Systems Chevron down icon Chevron up icon
8. Network Analysis Chevron down icon Chevron up icon
9. Time Series Chevron down icon Chevron up icon
10. Visualization Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(4 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Stephen Walker Jan 02, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I almost never write reviews (maybe never), but felt that this book deserves more attention. It provides a solid intuition to data science in clojure. Well written and nice to follow through the examples in each chapter. I only hope to hear of more books from Henry Garner!! Selfishly with more depth in time series analysis and online processing of that data.
Amazon Verified review Amazon
Dame Edna May 16, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
One of the best books for learning data science. Very thorough, practical, well written and interesting.
Amazon Verified review Amazon
madeinquant Jun 03, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is the best to learn Clojure and data science; Clojure is a unique programming language and it is not a popular programming language, learning Clojure is easy at the beginning but it is very difficult to solve a real world problem. Once you familiarize with Clojure, you will respect the power of LISP (Clojure is a dialect of LISP, Why Clojure?, Uncle Bob) Fortunately, I did a lot of old school programming (i.e. ANSI C, C++, LISP), since there are a lot of original concepts of LISP but learning Clojure is challenging.If you expect to cut and paste the code of your programming, this may not be suitable for you. I was programming data science and algorithms in javascript, C and programming data science in python 3. Python is beautiful, effective and the community has grown, you can find almost all useful data science libraries by googling, however, almost all libraries are difficult to learn the algorithm inside the box even though they are open-source. Learning the algorithm from scratch is a nightmare, I learn to code in Clojure and to migrate an existing code into Clojure. There are a lot of headaches during algorithms and Clojure learning, this book helps me a lot to resolve problems, all that said, this book is suitable for readers who have experienced a lot of programming, read the algorithms, get the concepts and write the algorithms in your own familiar programming language.
Amazon Verified review Amazon
skliarpawlo Nov 16, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great and very useful book for beginners both in Clojure and Data Science. So glad I ordered it
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela