You're reading from Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781784390297

Length 372 pages

Edition 2nd Edition

Languages

Clojure

Tools

Leiningen

Concepts

Data Analysis

Author (1):

Eric Richard Rochester

View More author details

Table of Contents (14) Chapters

Preface

1. Importing Data for Analysis FREE CHAPTER

2. Cleaning and Validating Data

3. Managing Complexity with Concurrent Programming

4. Improving Performance with Parallel Programming

5. Distributed Data Processing with Cascalog

6. Working with Incanter Datasets

7. Statistical Data Analysis with Incanter

8. Working with Mathematica and R

9. Clustering, Classifying, and Working with Weka

10. Working with Unstructured and Textual Data

11. Graphing in Incanter

12. Creating Charts for the Web

Index

Reading JSON data into Incanter datasets

Another data format that's becoming increasingly popular is JavaScript Object Notation (JSON, http://json.org/). Like CSV, this is a plain text format, so it's easy for programs to work with. It provides more information about the data than CSV does, but at the cost of being more verbose. It also allows the data to be structured in more complicated ways, such as hierarchies or sequences of hierarchies.

Because JSON is a much richer data model than CSV, we might need to transform the data. In that case, we can just pull out the information we're interested in and flatten the nested maps before we pass it to Incanter. In this recipe, however, we'll just work with fairly simple data structures.

Getting ready

First, here are the contents of the Leiningen project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [org.clojure/data.json "0.2.5"]])

Use these libraries in your REPL or program (inside an ns form):

(require '[incanter.core :as i]
         '[clojure.data.json :as json]
         '[clojure.java.io :as io])
(import '[java.io EOFException])

Moreover, you need some data. For this, I have a file named delicious-rss-214k.json and placed it in the folder named data. It contains a number of top-level JSON objects. For example, the first one starts like this:

{
    "guidislink": false,
    "link": "http://designreviver.com/tips/a-collection-of-wordpress-tutorials-tips-and-themes/",
    "title_detail": {
        "base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100",
        "value": "A Collection of Wordpress Tutorials, Tips and Themes | Design Reviver",
        "language": null,
        "type": "text/plain"
    },
    "author": "mccarrd4",
…

You can download this data file from Infochimps at http://www.ericrochester.com/clj-data-analysis/data/delicious-rss-214k.json.xz. You'll need to decompress it into the data directory.

How to do it…

Once everything's in place, we'll need a couple of functions to make it easier to handle the multiple JSON objects at the top level of the file:

We'll need a function that attempts to call a function on an instance of java.io.Reader and returns nil if there's an EOFException, in case there's a problem reading the file:
```
(defn test-eof [reader f]
  (try
    (f reader)
    (catch EOFException e
      nil)))
```
Now, we'll build on this to repeatedly parse a JSON document from an instance of java.io.Reader. We do this by repeatedly calling test-eof until eof or until it returns nil, accumulating the returned values as we go:
```
(defn read-all-json [reader]
  (loop [accum []]
    (if-let [record (test-eof reader json/read)]
      (recur (conj accum record))
      accum)))
```

Finally, we'll perform the previously mentioned two steps to read the data from the file:

(def d (i/to-dataset
         (with-open
           [r (io/reader
                 "data/delicious-rss-214k.json")]
           (read-all-json r))))

This binds d to a new dataset that contains the information read in from the JSON documents.

How it works…

Similar to all Lisp's (List Processing), Clojure is usually read from the inside out and from right to left. Let's break it down. clojure.java.io/reader opens the file for reading. read-all-json parses all of the JSON documents in the file into a sequence. In this case, it returns a vector of the maps. incanter.core/to-dataset takes a sequence of maps and returns an Incanter dataset. This dataset will use the keys in the maps as column names, and it will convert the data values into a matrix. Actually, to-dataset can accept many different data structures. Try doc to-dataset in the REPL (doc shows the documentation string attached to the function), or see the Incanter documentation at http://data-sorcery.org/contents/ for more information.