Reading XML data into Incanter datasets
One of the most popular formats for data is XML. Some people love it, while some hate it. However, almost everyone has to deal with it at some point. While Clojure can use Java's XML libraries, it also has its own package which provides a more natural way to work with XML in Clojure.
Getting ready
First, include these dependencies in your Leiningen project.clj
file:
(defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"]])
Use these libraries in your REPL or program:
(require '[incanter.core :as i] '[clojure.xml :as xml] '[clojure.zip :as zip])
Then, find a data file. I visited the website for the Open Data Catalog for Washington, D.C. (http://data.octo.dc.gov/), and downloaded the data for the 2013 crime incidents. I moved this file to data/crime_incidents_2013_plain.xml
. This is how the contents of the file look:
<?xml version="1.0" encoding="iso-8859-1"?> <dcst:ReportedCrimes xmlns:dcst="http://dc.gov/dcstat/types/1.0/"> <dcst:ReportedCrime xmlns:dcst="http://dc.gov/dcstat/types/1.0/"> <dcst:ccn><![CDATA[04104147]]></dcst:ccn> <dcst:reportdatetime> 2013-04-16T00:00:00-04:00 </dcst:reportdatetime> …
How to do it…
Now, let's see how to load this file into an Incanter dataset:
- The solution for this recipe is a little more complicated, so we'll wrap it into a function:
(defn load-xml-data [xml-file first-data next-data] (let [data-map (fn [node] [(:tag node) (first (:content node))])] (->> (xml/parse xml-file) zip/xml-zip first-data (iterate next-data) (take-while #(not (nil? %)) (map zip/children) (map #(mapcat data-map %)) (map #(apply array-map %)) i/to-dataset)))
- We can call the function like this. Because there are so many columns, we'll just verify the data that is loaded by looking at the column names and the row count:
user=> (def d (load-xml-data "data/crime_incidents_2013_plain.xml" zip/down zip/right)) user=> (i/col-names d) [:dcst:ccn :dcst:reportdatetime :dcst:shift :dcst:offense :dcst:method :dcst:lastmodifieddate :dcst:blocksiteaddress :dcst:blockxcoord :dcst:blockycoord :dcst:ward :dcst:anc :dcst:district :dcst:psa :dcst:neighborhoodcluster :dcst:businessimprovementdistrict :dcst:block_group :dcst:census_tract :dcst:voting_precinct :dcst:start_date :dcst:end_date] user=> (i/nrow d) 35826
This looks good. This gives you the number of crimes reported in the dataset.
How it works…
This recipe follows a typical pipeline for working with XML:
- Parsing an XML data file
- Extracting the data nodes
- Converting the data nodes into a sequence of maps representing the data
- Converting the data into an Incanter dataset
load-xml-data
implements this process. This takes three parameters:
- The input filename
- A function that takes the root node of the parsed XML and returns the first data node
- A function that takes a data node and returns the next data node or nil, if there are no more nodes
First, the function parses the XML file and wraps it in a zipper (we'll talk more about zippers in the next section). Then, it uses the two functions that are passed in to extract all of the data nodes as a sequence. For each data node, the function retrieves that node's child nodes and converts them into a series of tag name / content pairs. The pairs for each data node are converted into a map, and the sequence of maps is converted into an Incanter dataset.
There's more…
We used a couple of interesting data structures or constructs in this recipe. Both are common in functional programming or Lisp, but neither have made their way into more mainstream programming. We should spend a minute with them.
Navigating structures with zippers
The first thing that happens to the parsed XML is that it gets passed to clojure.zip/xml-zip
. Zippers are standard data structures that encapsulate the data at a position in a tree structure, as well as the information necessary to navigate back out. This takes Clojure's native XML data structure and turns it into something that can be navigated quickly using commands such as clojure.zip/down
and clojure.zip/right
. Being a functional programming language, Clojure encourages you to use immutable data structures, and zippers provide an efficient, natural way to navigate and modify a tree-like structure, such as an XML document.
Zippers are very useful and interesting, and understanding them can help you understand and work better with immutable data structures. For more information on zippers, the Clojure-doc page is helpful (http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html). However, if you would rather dive into the deep end, see Gerard Huet's paper, The Zipper (http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf).
Processing in a pipeline
We used the ->>
macro to express our process as a pipeline. For deeply nested function calls, this macro lets you read it from the left-hand side to the right-hand side, and this makes the process's data flow and series of transformations much more clear.
We can do this in Clojure because of its macro system. ->>
simply rewrites the calls into Clojure's native, nested format as the form is read. The first parameter of the macro is inserted into the next expression as the last parameter. This structure is inserted into the third expression as the last parameter, and so on, until the end of the form. Let's trace this through a few steps. Say, we start off with the expression (->> x first (map length) (apply +))
. As Clojure builds the final expression, here's each intermediate step (the elements to be combined are highlighted at each stage):
(->>
x first(map length) (apply +))
(->>
(first x) (map length)(apply +))
(->>
(map length (first x)) (apply +))
(apply + (map length (first x)))
Comparing XML and JSON
XML and JSON (from the Reading JSON data into Incanter datasets recipe) are very similar. Arguably, much of the popularity of JSON is driven by disillusionment with XML's verboseness.
When we're dealing with these formats in Clojure, the biggest difference is that JSON is converted directly to native Clojure data structures that mirror the data, such as maps and vectors Meanwhile, XML is read into record types that reflect the structure of XML, not the structure of the data.
In other words, the keys of the maps for JSON will come from the domains, first_name
or age
, for instance. However, the keys of the maps for XML will come from the data format, such as tag, attribute, or children, and the tag and attribute names will come from the domain. This extra level of abstraction makes XML more unwieldy.