You're reading from Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781784390297

Length 372 pages

Edition 2nd Edition

Languages

Clojure

Tools

Leiningen

Concepts

Data Analysis

Author (1):

Eric Richard Rochester

View More author details

Table of Contents (14) Chapters

Preface

1. Importing Data for Analysis FREE CHAPTER

2. Cleaning and Validating Data

3. Managing Complexity with Concurrent Programming

4. Improving Performance with Parallel Programming

5. Distributed Data Processing with Cascalog

6. Working with Incanter Datasets

7. Statistical Data Analysis with Incanter

8. Working with Mathematica and R

9. Clustering, Classifying, and Working with Weka

10. Working with Unstructured and Textual Data

11. Graphing in Incanter

12. Creating Charts for the Web

Index

Scraping data from tables in web pages

There's data everywhere on the Internet. Unfortunately, a lot of it is difficult to reach. It's buried in tables, articles, or deeply nested div tags. Web scraping (writing a program that walks over a web page and extracts data from it) is brittle and laborious, but it's often the only way to free this data so it can be used in our analyses. This recipe describes how to load a web page and dig down into its contents so that you can pull the data out.

To do this, we're going to use the Enlive (https://github.com/cgrand/enlive/wiki) library. This uses a domain specific language (DSL, a set of commands that make a small set of tasks very easy and natural) based on CSS selectors to locate elements within a web page. This library can also be used for templating. In this case, we'll just use it to get data back out of a web page.

Getting ready

First, you have to add Enlive to the dependencies in the project.clj file:

(defproject getting-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [enlive "1.1.5"]])

Next, use these packages in your REPL or script:

(require '[clojure.string :as string]
         '[net.cgrand.enlive-html :as html]
         '[incanter.core :as i])
(import [java.net URL])

Finally, identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-table.html, which looks like this:

It's intentionally stripped down, and it makes use of tables for layout (hence the comment about 1999).

How to do it…

Since this task is a little complicated, let's pull out the steps into several functions:

(defn to-keyword
  "This takes a string and returns a normalized keyword."
  [input]
  (->input
    string/lower-case
    (string/replace \space \-)
    keyword))

(defn load-data
  "This loads the data from a table at a URL."
  [url]
  (let [page (html/html-resource (URL. url))
        table (html/select page [:table#data])
        headers (->>
                  (html/select table [:tr :th])
                  (map html/text)
                  (map to-keyword)
                  vec)
        rows (->> (html/select table [:tr])
               (map #(html/select % [:td]))
               (map #(map html/text %))
               (filterseq))]
    (i/dataset headers rows))))))

Now, call load-data with the URL you want to load data from:

user=> (load-data (str "http://www.ericrochester.com/"
        "clj-data-analysis/data/small-sample-table.html"))
| :given-name | :surname |   :relation |
|-------------+----------+-------------|
|       Gomez |   Addams |      father |
|    Morticia |   Addams |      mother |
|     Pugsley |   Addams |     brother |
|   Wednesday |   Addams |      sister |
…

How it works…

The let bindings in load-data tell the story here. Let's talk about them one by one.

The first binding has Enlive download the resource and parse it into Enlive's internal representation:

  (let [page (html/html-resource (URL. url))

The next binding selects the table with the data ID:

        table (html/select page [:table#data])

Now, select of all the header cells from the table, extract the text from them, convert each to a keyword, and then convert the entire sequence into a vector. This gives headers for the dataset:

        headers (->>
                  (html/select table [:tr :th])
                  (map html/text)
                  (map to-keyword)
                  vec)

First, select each row individually. The next two steps are wrapped in map so that the cells in each row stay grouped together. In these steps, select the data cells in each row and extract the text from each. Last, use filterseq, which removes any rows with no data, such as the header row:

        rows (->> (html/select table [:tr])
               (map #(html/select % [:td]))
               (map #(map html/text %))
               (filterseq))]

Here's another view of this data. In this image, you can see some of the code from this web page. The variable names and select expressions are placed beside the HTML structures that they match. Hopefully, this makes it more clear how the select expressions correspond to the HTML elements:

Finally, convert everything to a dataset. incanter.core/dataset is a lower level constructor than incanter.core/to-dataset. It requires you to pass in the column names and data matrix as separate sequences:

    (i/dataset headers rows)))

It's important to realize that the code, as presented here, is the result of a lot of trial and error. Screen scraping usually is. Generally, I download the page and save it, so I don't have to keep requesting it from the web server. Next, I start the REPL and parse the web page there. Then, I can take a look at the web page and HTML with the browser's view source function, and I can examine the data from the web page interactively in the REPL. While working, I copy and paste the code back and forth between the REPL and my text editor, as it's convenient. This workflow and environment (sometimes called REPL-driven-development) makes screen scraping (a fiddly, difficult task at the best of times) almost enjoyable.

You're reading from Clojure Data Analysis Cookbook - Second Edition Dive into data analysis with Clojure through over 100 practical recipes for every stage of the analysis and collection process

Table of Contents (14) Chapters

Scraping data from tables in web pages

Getting ready

How to do it…

How it works…

See also

Authors (1)

Personalised recommendations for you