Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure Data Analysis Cookbook - Second Edition

You're reading from   Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Arrow left icon
Product type Paperback
Published in Jan 2015
Publisher
ISBN-13 9781784390297
Length 372 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Eric Richard Rochester Eric Richard Rochester
Author Profile Icon Eric Richard Rochester
Eric Richard Rochester
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Importing Data for Analysis FREE CHAPTER 2. Cleaning and Validating Data 3. Managing Complexity with Concurrent Programming 4. Improving Performance with Parallel Programming 5. Distributed Data Processing with Cascalog 6. Working with Incanter Datasets 7. Statistical Data Analysis with Incanter 8. Working with Mathematica and R 9. Clustering, Classifying, and Working with Weka 10. Working with Unstructured and Textual Data 11. Graphing in Incanter 12. Creating Charts for the Web Index

Querying RDF data with SPARQL

For the last recipe, Reading RDF data, the embedded domain-specific language (EDSL) used for the query gets converted to SPARQL, which is the query language for many linked data systems. If you squint just right at the query, it looks kind of like a SPARQL WHERE clause. For example, you can query DBPedia to get information about a city, such as its population, location, and other data. It's a simple query, but a query nevertheless.

This worked great when we had access to the raw data in our own triple store. However, if we need to access a remote SPARQL endpoint directly, it's more complicated.

For this recipe, we'll query DBPedia (http://dbpedia.org) for information on the United Arab Emirates currency, which is the Dirham. DBPedia extracts structured information from Wikipedia (the summary boxes) and republishes it as RDF. Just as Wikipedia is a useful first-stop for humans to get information about something, DBPedia is a good starting point for computer programs that want to gather data about a domain.

Getting ready

First, we need to make sure that the dependencies are listed in our Leiningen project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [edu.ucdenver.ccp/kr-sesame-core "1.4.17"]
                 [org.clojure/tools.logging "0.3.0"]
                 [org.slf4j/slf4j-simple "1.7.7"]])

Then, load the Clojure and Java libraries we'll use:

(require '[clojure.java.io :as io]
         '[clojure.xml :as xml]
         '[clojure.pprint :as pp]
         '[clojure.zip :as zip])
(use 'incanter.core
     'edu.ucdenver.ccp.kr.kb
     'edu.ucdenver.ccp.kr.rdf
     'edu.ucdenver.ccp.kr.sparql
     'edu.ucdenver.ccp.kr.sesame.kb
     'clojure.set)
(import [java.io File]
        [java.net URL URLEncoder])

How to do it…

As we work through this, we'll define a series of functions. Finally, we'll create one function, load-data, to orchestrate everything, and we'll finish by doing the following:

  1. We have to create a Sesame triple store and initialize it with the namespaces we'll use. For both of these, we'll use the kb-memstore and init-kb functions from Reading RDF data. We define a function that takes a URI for a subject in the triple store and constructs a SPARQL query that returns at most 200 statements about this subject. The function then filters out any statements with non-English strings for objects, but it allows everything else:
    (defn make-query
      "This creates a query that returns all of the 
      triples related to asubject URI. It 
      filters out non-English strings."
      ([subject kb]
       (binding [*kb* kb
                 *select-limit* 200]
         (sparql-select-query
           (list '(~subject ?/p ?/o)
                 '(:or (:not (:isLiteral ?/o))
                       (!= (:datatype ?/o) rdf/langString)
                       (= (:lang ?/o) ["en"])))))))
  2. Now that we have the query, we'll need to encode it into a URL in order to retrieve the results:
    (defn make-query-uri
      "This constructs a URI for the query."
      ([base-uri query]
       (URL. (str base-uri
                  "?format=" 
                  (URLEncoder/encode "text/xml")
                  "&query=" (URLEncoder/encode query)))))
  3. Once we get a result, we'll parse the XML file, wrap it in a zipper, and navigate to the first result. All of this will be in a function that we'll write in a minute. Right now, the next function will take this first result node and return a list of all the results:
    (defn result-seq
      "This takes the first result and returns a sequence 
      of this node, plus all of the nodes to the right of 
      it."
      ([first-result]
       (cons (zip/node first-result)
             (zip/rights first-result))))
  4. The following set of functions takes each result node and returns a key-value pair (result-to-kv). It uses binding-str to pull the results out of the XML. Then, accum-hash pushes the key-value pairs into a map. Keys that occur more than once have their values accumulated in a vector:
    (defn binding-str
      "This takes a binding, pulls out the first tag's 
      content, and concatenates it into a string."
      ([b]
       (apply str (:content (first (:content b))))))
    
    (defn result-to-kv
      "This takes a result node and creates a key-value 
      vector pair from it."
      ([r]
       (let [[p o] (:content r)]
         [(binding-str p) (binding-str o)])))
    
    (defn accum-hash
      ([m [k v]]
       (if-let [current (m k)]
         (assoc m k (str current \space v))
         (assoc m k v))))
  5. For the last utility function, we'll define rekey. This will convert the keys of a map based on another map:
    (defn rekey
      "This just flips the arguments for 
      clojure.set/rename-keys to make it more
      convenient."
      ([k-map map]
       (rename-keys 
         (select-keys map (keys k-map)) k-map)))
  6. Let's now add a function that takes a SPARQL endpoint and subject and returns a sequence of result nodes. This will use several of the functions we've just defined:
    (defn query-sparql-results
      "This queries a SPARQL endpoint and returns a 
      sequence of result nodes."
      ([sparql-uri subject kb]
       (->> kb
         ;; Build the URI query string.
         (make-query subject)
         (make-query-uri sparql-uri)
         ;; Get the results, parse the XML,
         ;; and return the zipper.
         io/input-stream
         xml/parse
         zip/xml-zip
         ;; Find the first child.
         zip/down
         zip/right
         zip/down
         ;; Convert all children into a sequence.
         result-seq)))
  7. Finally, we can pull everything together. Here's load-data:
    (defn load-data
      "This loads the data about a currency for the 
      given URI."
      [sparql-uri subject col-map]
      (->>
        ;; Initialize the triple store.
        (kb-memstore)
        init-kb
        ;; Get the results.
        (query-sparql-results sparql-uri subject)
        ;; Generate a mapping.
        (map result-to-kv)
        (reduce accum-hash {})
        ;; Translate the keys in the map.
        (rekey col-map)
        ;; And create a dataset.
        to-dataset))
  8. Now, let's use this data. We can define a set of variables to make it easier to reference the namespaces we'll use. We'll use these to create the mapping to column names:
    (def rdfs "http://www.w3.org/2000/01/rdf-schema#")
    (def dbpedia "http:///dbpedia.org/resource/")
    (def dbpedia-ont "http://dbpedia.org/ontology/")
    (def dbpedia-prop "http://dbpedia.org/property/")
    
    (def col-map {(str rdfs 'label) :name,
      (str dbpedia-prop 'usingCountries) :country
      (str dbpedia-prop 'peggedWith) :pegged-with
      (str dbpedia-prop 'symbol) :symbol
      (str dbpedia-prop 'usedBanknotes) :used-banknotes
      (str dbpedia-prop 'usedCoins) :used-coins
      (str dbpedia-prop 'inflationRate) :inflation})
  9. We call load-data with the DBPedia SPARQL endpoint, the resource we want information about (as a symbol), and the column map:
    user=> (def d (load-data "http://dbpedia.org/sparql"
                    (symbol (str dbpedia dbpedia "United_Arab_Emirates_dirham"))
                    col-map))
    user=> (sel d :cols [:country :name :symbol])
    
    |             :country |                       :name | :symbol |
    |----------------------+-----------------------------+---------|
    | United Arab Emirates | United Arab Emirates dirham |       إ.د |

How it works…

The only part of this recipe that has to do with SPARQL, really, is the make-query function. It uses the sparql-select-query function to generate a SPARQL query string from the query pattern. This pattern has to be interpreted in the context of the triple store that has the namespaces defined. This context is set using the binding command. We can see how this function works by calling it from the REPL by itself:

user=> (println 
       (make-query 
         (symbol (str dbpedia "/United_Arab_Emirates_dirham"))
         (init-kb (kb-memstore))))
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?p ?o
WHERE {  <http://dbpedia.org/resource/United_Arab_Emirates_dirham> ?p   ?o .
 FILTER (  ( ! isLiteral(?o)
 ||  (  datatype(?o)  !=<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> )
 ||  (  lang(?o)  = "en" )  )
 )
} LIMIT 200

The rest of the recipe is concerned with parsing the XML format of the results, and in many ways, it's similar to the last recipe.

There's more…

For more information on RDF and linked data, see the previous recipe, Reading RDF data.

You have been reading a chapter from
Clojure Data Analysis Cookbook - Second Edition - Second Edition
Published in: Jan 2015
Publisher:
ISBN-13: 9781784390297
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime