Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure Data Analysis Cookbook - Second Edition

You're reading from   Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Arrow left icon
Product type Paperback
Published in Jan 2015
Publisher
ISBN-13 9781784390297
Length 372 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Eric Richard Rochester Eric Richard Rochester
Author Profile Icon Eric Richard Rochester
Eric Richard Rochester
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Importing Data for Analysis 2. Cleaning and Validating Data FREE CHAPTER 3. Managing Complexity with Concurrent Programming 4. Improving Performance with Parallel Programming 5. Distributed Data Processing with Cascalog 6. Working with Incanter Datasets 7. Statistical Data Analysis with Incanter 8. Working with Mathematica and R 9. Clustering, Classifying, and Working with Weka 10. Working with Unstructured and Textual Data 11. Graphing in Incanter 12. Creating Charts for the Web Index

Reading RDF data

More and more data is going up on the Internet using linked data in a variety of formats such as microformats, RDFa, and RDF/XML.

Linked data represents entities as consistent URLs and includes links to other databases of the linked data. In a sense, it's the computer equivalent of human-readable web pages. Often, these formats are used for open data, such as the data published by some governments, like in the UK and elsewhere.

Linked data adds a lot of flexibility and power, but it also introduces more complexity. Often, to work effectively with linked data, we need to start a triple store of some kind. In this recipe and the next three, we'll use Sesame (http://rdf4j.org/) and the kr Clojure library (https://github.com/drlivingston/kr).

Getting ready

First, we need to make sure that the dependencies are listed in our Leiningen project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.17"]
                 [org.clojure/tools.logging "0.3.0"]
                 [org.slf4j/slf4j-simple "1.7.7"]])

We'll execute these packages to have these loaded into our script or REPL:

(use 'incanter.core
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb
     'clojure.set)
(import [java.io File])

For this example, we'll get data from the Telegraphis Linked Data assets. We'll pull down the database of currencies at http://telegraphis.net/data/currencies/currencies.ttl. Just to be safe, I've downloaded that file and saved it as data/currencies.ttl, and we'll access it from there.

We'll store the data, at least temporarily, in a Sesame data store (http://notes.3kbo.com/sesame) that allows us to easily store and query linked data.

How to do it…

The longest part of this process will be to define the data. The libraries we're using do all of the heavy lifting, as shown in the steps given below:

  1. First, we will create the triple store and register the namespaces that the data uses. We'll bind this triple store to the name tstore:
    (defn kb-memstore
      "This creates a Sesame triple store in memory."
      []
      (kb :sesame-mem))
    (defn init-kb [kb-store]
      (register-namespaces
        kb-store
        '(("geographis"
            "http://telegraphis.net/ontology/geography/geography#")
          ("code"
            "http://telegraphis.net/ontology/measurement/code#")
          ("money"
            "http://telegraphis.net/ontology/money/money#")
          ("owl"
            "http://www.w3.org/2002/07/owl#")
          ("rdf"
            "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
          ("xsd"
            "http://www.w3.org/2001/XMLSchema#")
          ("currency"
            "http://telegraphis.net/data/currencies/")
          ("dbpedia" "http://dbpedia.org/resource/")
          ("dbpedia-ont" "http://dbpedia.org/ontology/")
          ("dbpedia-prop" "http://dbpedia.org/property/")
          ("err" "http://ericrochester.com/"))))
     
    (def t-store (init-kb (kb-memstore)))
  2. After taking a look at the data some more, we can identify what data we want to pull out and start to formulate a query. We'll use the kr library's (https://github.com/drlivingston/kr) query DSL and bind it to the name q:
    (def q '((?/c rdf/type money/Currency)
               (?/c money/name ?/full_name)
               (?/c money/shortName ?/name)
               (?/c money/symbol ?/symbol)
               (?/c money/minorName ?/minor_name)
               (?/c money/minorExponent ?/minor_exp)
               (?/c money/isoAlpha ?/iso)
               (?/c money/currencyOf ?/country)))
  3. Now, we need a function that takes a result map and converts the variable names in the query into column names in the output dataset. The header-keyword and fix-headers functions will do this:
    (defn header-keyword
      "This converts a query symbol to a keyword."
      [header-symbol]
      (keyword (.replace (name header-symbol) \_ \-)))
    (defn fix-headers
      "This changes all of the keys in the map to make them
      valid header keywords."
      [coll]
      (into {}
           (map (fn [[k v]] [(header-keyword k) v])
                coll)))
  4. As usual, once all of the pieces are in place, the function that ties everything together is short:
    (defn load-data
      [krdf-file q]
      (load-rdf-file k rdf-file)
      (to-dataset (map fix-headers (query k q))))
  5. Also, using this function is just as simple:
    user=> (sel d :rows (range 3)
             :cols [:full-name :name :iso :symbol])
    
    |                  :full-name |   :name | :iso | :symbol |
    |-----------------------------+---------+------+---------|
    | United Arab Emirates dirham |  dirham |  AED |       إ.د |
    |              Afghan afghani | afghani |  AFN |       ؋ |
    |                Albanian lek |     lek |  ALL |       L |

How it works…

First, here's some background information. Resource Description Format (RDF) isn't an XML format, although it's often written using XML. (There are other formats as well, such as N3 and Turtle.) RDF sees the world as a set of statements. Each statement has at least three parts (a triple): a subject, predicate, and object. The subject and predicate must be URIs. (URIs are like URLs, only more general. For example, uri:7890 is a valid URI.) Objects can be a literal or a URI. The URIs form a graph. They are linked to each other and make statements about each other. This is where the linked in linked data comes from.

If you want more information about linked data, http://linkeddata.org/guides-and-tutorials has some good recommendations.

Now, about our recipe. From a high level, the process we used here is pretty simple, given as follows:

  1. Create a triple store (kb-memstore and init-kb)
  2. Load the data (load-data)
  3. Query the data to pull out only what you want (q and load-data)
  4. Transform it into a format that Incanter can ingest easily (rekey and col-map)
  5. Finally, create the Incanter dataset (load-data)

The newest thing here is the query format. kb uses a nice SPARQL-like DSL to express the queries. In fact, it's so easy to use that we'll deal with it instead of working with raw RDF. The items starting with ?/ are variables which will be used as keys for the result maps. The other items look like rdf-namespace/value. The namespace is taken from the registered namespaces defined in init-kb. These are different from Clojure's namespaces, although they serve a similar function for your data: to partition and provide context.

See also

The next few recipes, Querying RDF data with SPARQL and Aggregating data from different formats, build on this recipe and will use much of the same set up and techniques.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime