Reading JSON data into Incanter datasets
Another data format that's becoming increasingly popular is JavaScript Object Notation (JSON, http://json.org/). Like CSV, this is a plain text format, so it's easy for programs to work with. It provides more information about the data than CSV does, but at the cost of being more verbose. It also allows the data to be structured in more complicated ways, such as hierarchies or sequences of hierarchies.
Because JSON is a much richer data model than CSV, we might need to transform the data. In that case, we can just pull out the information we're interested in and flatten the nested maps before we pass it to Incanter. In this recipe, however, we'll just work with fairly simple data structures.
Getting ready
First, here are the contents of the Leiningen project.clj
file:
(defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [org.clojure/data.json "0.2.5"]])
Use these libraries in your REPL or program (inside an ns
form):
(require '[incanter.core :as i] '[clojure.data.json :as json] '[clojure.java.io :as io]) (import '[java.io EOFException])
Moreover, you need some data. For this, I have a file named delicious-rss-214k.json
and placed it in the folder named data. It contains a number of top-level JSON objects. For example, the first one starts like this:
{ "guidislink": false, "link": "http://designreviver.com/tips/a-collection-of-wordpress-tutorials-tips-and-themes/", "title_detail": { "base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100", "value": "A Collection of Wordpress Tutorials, Tips and Themes | Design Reviver", "language": null, "type": "text/plain" }, "author": "mccarrd4", …
You can download this data file from Infochimps at http://www.ericrochester.com/clj-data-analysis/data/delicious-rss-214k.json.xz. You'll need to decompress it into the data directory.
How to do it…
Once everything's in place, we'll need a couple of functions to make it easier to handle the multiple JSON objects at the top level of the file:
- We'll need a function that attempts to call a function on an instance of
java.io.Reader
and returnsnil
if there's anEOFException
, in case there's a problem reading the file:(defn test-eof [reader f] (try (f reader) (catch EOFException e nil)))
- Now, we'll build on this to repeatedly parse a JSON document from an instance of
java.io.Reader
. We do this by repeatedly callingtest-eof
untileof
or until it returnsnil
, accumulating the returned values as we go:(defn read-all-json [reader] (loop [accum []] (if-let [record (test-eof reader json/read)] (recur (conj accum record)) accum)))
- Finally, we'll perform the previously mentioned two steps to read the data from the file:
(def d (i/to-dataset (with-open [r (io/reader "data/delicious-rss-214k.json")] (read-all-json r))))
This binds d
to a new dataset that contains the information read in from the JSON documents.
How it works…
Similar to all Lisp's (List Processing), Clojure is usually read from the inside out and from right to left. Let's break it down. clojure.java.io/reader
opens the file for reading. read-all-json
parses all of the JSON documents in the file into a sequence. In this case, it returns a vector of the maps. incanter.core/to-dataset
takes a sequence of maps and returns an Incanter dataset. This dataset will use the keys in the maps as column names, and it will convert the data values into a matrix. Actually, to-dataset
can accept many different data structures. Try doc to-dataset
in the REPL (doc
shows the documentation string attached to the function), or see the Incanter documentation at http://data-sorcery.org/contents/ for more information.