Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Clojure Data Analysis Cookbook - Second Edition

You're reading from   Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Arrow left icon
Product type Paperback
Published in Jan 2015
Publisher
ISBN-13 9781784390297
Length 372 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Eric Richard Rochester Eric Richard Rochester
Author Profile Icon Eric Richard Rochester
Eric Richard Rochester
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Importing Data for Analysis 2. Cleaning and Validating Data FREE CHAPTER 3. Managing Complexity with Concurrent Programming 4. Improving Performance with Parallel Programming 5. Distributed Data Processing with Cascalog 6. Working with Incanter Datasets 7. Statistical Data Analysis with Incanter 8. Working with Mathematica and R 9. Clustering, Classifying, and Working with Weka 10. Working with Unstructured and Textual Data 11. Graphing in Incanter 12. Creating Charts for the Web Index

Scraping data from tables in web pages

There's data everywhere on the Internet. Unfortunately, a lot of it is difficult to reach. It's buried in tables, articles, or deeply nested div tags. Web scraping (writing a program that walks over a web page and extracts data from it) is brittle and laborious, but it's often the only way to free this data so it can be used in our analyses. This recipe describes how to load a web page and dig down into its contents so that you can pull the data out.

To do this, we're going to use the Enlive (https://github.com/cgrand/enlive/wiki) library. This uses a domain specific language (DSL, a set of commands that make a small set of tasks very easy and natural) based on CSS selectors to locate elements within a web page. This library can also be used for templating. In this case, we'll just use it to get data back out of a web page.

Getting ready

First, you have to add Enlive to the dependencies in the project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [enlive "1.1.5"]])

Next, use these packages in your REPL or script:

(require '[clojure.string :as string]
         '[net.cgrand.enlive-html :as html]
         '[incanter.core :as i])
(import [java.net URL])

Finally, identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-table.html, which looks like this:

Getting ready

It's intentionally stripped down, and it makes use of tables for layout (hence the comment about 1999).

How to do it…

  1. Since this task is a little complicated, let's pull out the steps into several functions:
    (defn to-keyword
      "This takes a string and returns a normalized keyword."
      [input]
      (->input
        string/lower-case
        (string/replace \space \-)
        keyword))
    
    (defn load-data
      "This loads the data from a table at a URL."
      [url]
      (let [page (html/html-resource (URL. url))
            table (html/select page [:table#data])
            headers (->>
                      (html/select table [:tr :th])
                      (map html/text)
                      (map to-keyword)
                      vec)
            rows (->> (html/select table [:tr])
                   (map #(html/select % [:td]))
                   (map #(map html/text %))
                   (filterseq))]
        (i/dataset headers rows))))))
  2. Now, call load-data with the URL you want to load data from:
    user=> (load-data (str "http://www.ericrochester.com/"
            "clj-data-analysis/data/small-sample-table.html"))
    | :given-name | :surname |   :relation |
    |-------------+----------+-------------|
    |       Gomez |   Addams |      father |
    |    Morticia |   Addams |      mother |
    |     Pugsley |   Addams |     brother |
    |   Wednesday |   Addams |      sister |
    …

How it works…

The let bindings in load-data tell the story here. Let's talk about them one by one.

The first binding has Enlive download the resource and parse it into Enlive's internal representation:

  (let [page (html/html-resource (URL. url))

The next binding selects the table with the data ID:

        table (html/select page [:table#data])

Now, select of all the header cells from the table, extract the text from them, convert each to a keyword, and then convert the entire sequence into a vector. This gives headers for the dataset:

        headers (->>
                  (html/select table [:tr :th])
                  (map html/text)
                  (map to-keyword)
                  vec)

First, select each row individually. The next two steps are wrapped in map so that the cells in each row stay grouped together. In these steps, select the data cells in each row and extract the text from each. Last, use filterseq, which removes any rows with no data, such as the header row:

        rows (->> (html/select table [:tr])
               (map #(html/select % [:td]))
               (map #(map html/text %))
               (filterseq))]

Here's another view of this data. In this image, you can see some of the code from this web page. The variable names and select expressions are placed beside the HTML structures that they match. Hopefully, this makes it more clear how the select expressions correspond to the HTML elements:

How it works…

Finally, convert everything to a dataset. incanter.core/dataset is a lower level constructor than incanter.core/to-dataset. It requires you to pass in the column names and data matrix as separate sequences:

    (i/dataset headers rows)))

It's important to realize that the code, as presented here, is the result of a lot of trial and error. Screen scraping usually is. Generally, I download the page and save it, so I don't have to keep requesting it from the web server. Next, I start the REPL and parse the web page there. Then, I can take a look at the web page and HTML with the browser's view source function, and I can examine the data from the web page interactively in the REPL. While working, I copy and paste the code back and forth between the REPL and my text editor, as it's convenient. This workflow and environment (sometimes called REPL-driven-development) makes screen scraping (a fiddly, difficult task at the best of times) almost enjoyable.

See also

  • The next recipe, Scraping textual data from web pages, has a more involved example of data scraping on an HTML page
  • The Aggregating data from different formats recipe has a practical, real-life example of data scraping in a table
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime