You're reading from Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781784390297

Length 372 pages

Edition 2nd Edition

Languages

Clojure

Tools

Leiningen

Concepts

Data Analysis

Author (1):

Eric Richard Rochester

View More author details

Table of Contents (14) Chapters

Preface

1. Importing Data for Analysis FREE CHAPTER

2. Cleaning and Validating Data

3. Managing Complexity with Concurrent Programming

4. Improving Performance with Parallel Programming

5. Distributed Data Processing with Cascalog

6. Working with Incanter Datasets

7. Statistical Data Analysis with Incanter

8. Working with Mathematica and R

9. Clustering, Classifying, and Working with Weka

10. Working with Unstructured and Textual Data

11. Graphing in Incanter

12. Creating Charts for the Web

Index

Reading XML data into Incanter datasets

One of the most popular formats for data is XML. Some people love it, while some hate it. However, almost everyone has to deal with it at some point. While Clojure can use Java's XML libraries, it also has its own package which provides a more natural way to work with XML in Clojure.

Getting ready

First, include these dependencies in your Leiningen project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]])

Use these libraries in your REPL or program:

(require '[incanter.core :as i]
         '[clojure.xml :as xml]
         '[clojure.zip :as zip])

Then, find a data file. I visited the website for the Open Data Catalog for Washington, D.C. (http://data.octo.dc.gov/), and downloaded the data for the 2013 crime incidents. I moved this file to data/crime_incidents_2013_plain.xml. This is how the contents of the file look:

<?xml version="1.0" encoding="iso-8859-1"?>
<dcst:ReportedCrimes 
    xmlns:dcst="http://dc.gov/dcstat/types/1.0/">
  <dcst:ReportedCrime 
     xmlns:dcst="http://dc.gov/dcstat/types/1.0/">
        <dcst:ccn><![CDATA[04104147]]></dcst:ccn>
        <dcst:reportdatetime>
          2013-04-16T00:00:00-04:00
        </dcst:reportdatetime>
  …

How to do it…

Now, let's see how to load this file into an Incanter dataset:

The solution for this recipe is a little more complicated, so we'll wrap it into a function:

(defn load-xml-data [xml-file first-data next-data]
  (let [data-map (fn [node]
                   [(:tag node) (first (:content node))])]
    (->>
      (xml/parse xml-file)
      zip/xml-zip
      first-data
      (iterate next-data)
      (take-while #(not (nil? %))
      (map zip/children)
      (map #(mapcat data-map %))
      (map #(apply array-map %))
            i/to-dataset)))

We can call the function like this. Because there are so many columns, we'll just verify the data that is loaded by looking at the column names and the row count:

user=> (def d
         (load-xml-data "data/crime_incidents_2013_plain.xml"
                        zip/down zip/right))
user=> (i/col-names d)
[:dcst:ccn :dcst:reportdatetime :dcst:shift :dcst:offense :dcst:method :dcst:lastmodifieddate :dcst:blocksiteaddress :dcst:blockxcoord :dcst:blockycoord :dcst:ward :dcst:anc :dcst:district :dcst:psa :dcst:neighborhoodcluster :dcst:businessimprovementdistrict :dcst:block_group :dcst:census_tract :dcst:voting_precinct :dcst:start_date :dcst:end_date]
user=> (i/nrow d)
35826

This looks good. This gives you the number of crimes reported in the dataset.

How it works…

This recipe follows a typical pipeline for working with XML:

Parsing an XML data file
Extracting the data nodes
Converting the data nodes into a sequence of maps representing the data
Converting the data into an Incanter dataset

load-xml-data implements this process. This takes three parameters:

The input filename
A function that takes the root node of the parsed XML and returns the first data node
A function that takes a data node and returns the next data node or nil, if there are no more nodes

First, the function parses the XML file and wraps it in a zipper (we'll talk more about zippers in the next section). Then, it uses the two functions that are passed in to extract all of the data nodes as a sequence. For each data node, the function retrieves that node's child nodes and converts them into a series of tag name / content pairs. The pairs for each data node are converted into a map, and the sequence of maps is converted into an Incanter dataset.

There's more…

We used a couple of interesting data structures or constructs in this recipe. Both are common in functional programming or Lisp, but neither have made their way into more mainstream programming. We should spend a minute with them.

Navigating structures with zippers

The first thing that happens to the parsed XML is that it gets passed to clojure.zip/xml-zip. Zippers are standard data structures that encapsulate the data at a position in a tree structure, as well as the information necessary to navigate back out. This takes Clojure's native XML data structure and turns it into something that can be navigated quickly using commands such as clojure.zip/down and clojure.zip/right. Being a functional programming language, Clojure encourages you to use immutable data structures, and zippers provide an efficient, natural way to navigate and modify a tree-like structure, such as an XML document.

Zippers are very useful and interesting, and understanding them can help you understand and work better with immutable data structures. For more information on zippers, the Clojure-doc page is helpful (http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html). However, if you would rather dive into the deep end, see Gerard Huet's paper, The Zipper (http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf).

Processing in a pipeline

We used the ->> macro to express our process as a pipeline. For deeply nested function calls, this macro lets you read it from the left-hand side to the right-hand side, and this makes the process's data flow and series of transformations much more clear.

We can do this in Clojure because of its macro system. ->> simply rewrites the calls into Clojure's native, nested format as the form is read. The first parameter of the macro is inserted into the next expression as the last parameter. This structure is inserted into the third expression as the last parameter, and so on, until the end of the form. Let's trace this through a few steps. Say, we start off with the expression (->> x first (map length) (apply +)). As Clojure builds the final expression, here's each intermediate step (the elements to be combined are highlighted at each stage):

(->> x first (map length) (apply +))
(->>(first x) (map length) (apply +))
(->>(map length (first x)) (apply +))
(apply + (map length (first x)))

Comparing XML and JSON

XML and JSON (from the Reading JSON data into Incanter datasets recipe) are very similar. Arguably, much of the popularity of JSON is driven by disillusionment with XML's verboseness.

When we're dealing with these formats in Clojure, the biggest difference is that JSON is converted directly to native Clojure data structures that mirror the data, such as maps and vectors Meanwhile, XML is read into record types that reflect the structure of XML, not the structure of the data.

In other words, the keys of the maps for JSON will come from the domains, first_name or age, for instance. However, the keys of the maps for XML will come from the data format, such as tag, attribute, or children, and the tag and attribute names will come from the domain. This extra level of abstraction makes XML more unwieldy.