You're reading from Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781784390297

Length 372 pages

Edition 2nd Edition

Languages

Clojure

Tools

Leiningen

Concepts

Data Analysis

Author (1):

Eric Richard Rochester

View More author details

Table of Contents (14) Chapters

Preface

1. Importing Data for Analysis FREE CHAPTER

2. Cleaning and Validating Data

3. Managing Complexity with Concurrent Programming

4. Improving Performance with Parallel Programming

5. Distributed Data Processing with Cascalog

6. Working with Incanter Datasets

7. Statistical Data Analysis with Incanter

8. Working with Mathematica and R

9. Clustering, Classifying, and Working with Weka

10. Working with Unstructured and Textual Data

11. Graphing in Incanter

12. Creating Charts for the Web

Index

Reading RDF data

More and more data is going up on the Internet using linked data in a variety of formats such as microformats, RDFa, and RDF/XML.

Linked data represents entities as consistent URLs and includes links to other databases of the linked data. In a sense, it's the computer equivalent of human-readable web pages. Often, these formats are used for open data, such as the data published by some governments, like in the UK and elsewhere.

Linked data adds a lot of flexibility and power, but it also introduces more complexity. Often, to work effectively with linked data, we need to start a triple store of some kind. In this recipe and the next three, we'll use Sesame (http://rdf4j.org/) and the kr Clojure library (https://github.com/drlivingston/kr).

Getting ready

First, we need to make sure that the dependencies are listed in our Leiningen project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.17"]
                 [org.clojure/tools.logging "0.3.0"]
                 [org.slf4j/slf4j-simple "1.7.7"]])

We'll execute these packages to have these loaded into our script or REPL:

(use 'incanter.core
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb
     'clojure.set)
(import [java.io File])

For this example, we'll get data from the Telegraphis Linked Data assets. We'll pull down the database of currencies at http://telegraphis.net/data/currencies/currencies.ttl. Just to be safe, I've downloaded that file and saved it as data/currencies.ttl, and we'll access it from there.

We'll store the data, at least temporarily, in a Sesame data store (http://notes.3kbo.com/sesame) that allows us to easily store and query linked data.

How to do it…

The longest part of this process will be to define the data. The libraries we're using do all of the heavy lifting, as shown in the steps given below:

First, we will create the triple store and register the namespaces that the data uses. We'll bind this triple store to the name tstore:

(defn kb-memstore
  "This creates a Sesame triple store in memory."
  []
  (kb :sesame-mem))
(defn init-kb [kb-store]
  (register-namespaces
    kb-store
    '(("geographis"
        "http://telegraphis.net/ontology/geography/geography#")
      ("code"
        "http://telegraphis.net/ontology/measurement/code#")
      ("money"
        "http://telegraphis.net/ontology/money/money#")
      ("owl"
        "http://www.w3.org/2002/07/owl#")
      ("rdf"
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
      ("xsd"
        "http://www.w3.org/2001/XMLSchema#")
      ("currency"
        "http://telegraphis.net/data/currencies/")
      ("dbpedia" "http://dbpedia.org/resource/")
      ("dbpedia-ont" "http://dbpedia.org/ontology/")
      ("dbpedia-prop" "http://dbpedia.org/property/")
      ("err" "http://ericrochester.com/"))))
 
(def t-store (init-kb (kb-memstore)))

After taking a look at the data some more, we can identify what data we want to pull out and start to formulate a query. We'll use the kr library's (https://github.com/drlivingston/kr) query DSL and bind it to the name q:

(def q '((?/c rdf/type money/Currency)
           (?/c money/name ?/full_name)
           (?/c money/shortName ?/name)
           (?/c money/symbol ?/symbol)
           (?/c money/minorName ?/minor_name)
           (?/c money/minorExponent ?/minor_exp)
           (?/c money/isoAlpha ?/iso)
           (?/c money/currencyOf ?/country)))

Now, we need a function that takes a result map and converts the variable names in the query into column names in the output dataset. The header-keyword and fix-headers functions will do this:

(defn header-keyword
  "This converts a query symbol to a keyword."
  [header-symbol]
  (keyword (.replace (name header-symbol) \_ \-)))
(defn fix-headers
  "This changes all of the keys in the map to make them
  valid header keywords."
  [coll]
  (into {}
       (map (fn [[k v]] [(header-keyword k) v])
            coll)))

As usual, once all of the pieces are in place, the function that ties everything together is short:

(defn load-data
  [krdf-file q]
  (load-rdf-file k rdf-file)
  (to-dataset (map fix-headers (query k q))))

Also, using this function is just as simple:

user=> (sel d :rows (range 3)
         :cols [:full-name :name :iso :symbol])

|                  :full-name |   :name | :iso | :symbol |
|-----------------------------+---------+------+---------|
| United Arab Emirates dirham |  dirham |  AED |       إ.د |
|              Afghan afghani | afghani |  AFN |       ؋ |
|                Albanian lek |     lek |  ALL |       L |

How it works…

First, here's some background information. Resource Description Format (RDF) isn't an XML format, although it's often written using XML. (There are other formats as well, such as N3 and Turtle.) RDF sees the world as a set of statements. Each statement has at least three parts (a triple): a subject, predicate, and object. The subject and predicate must be URIs. (URIs are like URLs, only more general. For example, uri:7890 is a valid URI.) Objects can be a literal or a URI. The URIs form a graph. They are linked to each other and make statements about each other. This is where the linked in linked data comes from.

If you want more information about linked data, http://linkeddata.org/guides-and-tutorials has some good recommendations.

Now, about our recipe. From a high level, the process we used here is pretty simple, given as follows:

Create a triple store (kb-memstore and init-kb)
Load the data (load-data)
Query the data to pull out only what you want (q and load-data)
Transform it into a format that Incanter can ingest easily (rekey and col-map)
Finally, create the Incanter dataset (load-data)

The newest thing here is the query format. kb uses a nice SPARQL-like DSL to express the queries. In fact, it's so easy to use that we'll deal with it instead of working with raw RDF. The items starting with ?/ are variables which will be used as keys for the result maps. The other items look like rdf-namespace/value. The namespace is taken from the registered namespaces defined in init-kb. These are different from Clojure's namespaces, although they serve a similar function for your data: to partition and provide context.