Scraping data from tables in web pages
There's data everywhere on the Internet. Unfortunately, a lot of it is difficult to reach. It's buried in tables, articles, or deeply nested div
tags. Web scraping (writing a program that walks over a web page and extracts data from it) is brittle and laborious, but it's often the only way to free this data so it can be used in our analyses. This recipe describes how to load a web page and dig down into its contents so that you can pull the data out.
To do this, we're going to use the Enlive (https://github.com/cgrand/enlive/wiki) library. This uses a domain specific language (DSL, a set of commands that make a small set of tasks very easy and natural) based on CSS selectors to locate elements within a web page. This library can also be used for templating. In this case, we'll just use it to get data back out of a web page.
Getting ready
First, you have to add Enlive to the dependencies in the project.clj
file:
(defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [enlive "1.1.5"]])
Next, use these packages in your REPL or script:
(require '[clojure.string :as string] '[net.cgrand.enlive-html :as html] '[incanter.core :as i]) (import [java.net URL])
Finally, identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-table.html, which looks like this:
It's intentionally stripped down, and it makes use of tables for layout (hence the comment about 1999).
How to do it…
- Since this task is a little complicated, let's pull out the steps into several functions:
(defn to-keyword "This takes a string and returns a normalized keyword." [input] (->input string/lower-case (string/replace \space \-) keyword)) (defn load-data "This loads the data from a table at a URL." [url] (let [page (html/html-resource (URL. url)) table (html/select page [:table#data]) headers (->> (html/select table [:tr :th]) (map html/text) (map to-keyword) vec) rows (->> (html/select table [:tr]) (map #(html/select % [:td])) (map #(map html/text %)) (filterseq))] (i/dataset headers rows))))))
- Now, call
load-data
with the URL you want to load data from:user=> (load-data (str "http://www.ericrochester.com/" "clj-data-analysis/data/small-sample-table.html")) | :given-name | :surname | :relation | |-------------+----------+-------------| | Gomez | Addams | father | | Morticia | Addams | mother | | Pugsley | Addams | brother | | Wednesday | Addams | sister | …
How it works…
The let
bindings in load-data
tell the story here. Let's talk about them one by one.
The first binding has Enlive download the resource and parse it into Enlive's internal representation:
(let [page (html/html-resource (URL. url))
The next binding selects the table with the data
ID:
table (html/select page [:table#data])
Now, select of all the header cells from the table, extract the text from them, convert each to a keyword, and then convert the entire sequence into a vector. This gives headers for the dataset:
headers (->> (html/select table [:tr :th]) (map html/text) (map to-keyword) vec)
First, select each row individually. The next two steps are wrapped in map
so that the cells in each row stay grouped together. In these steps, select the data cells in each row and extract the text from each. Last, use filterseq
, which removes any rows with no data, such as the header row:
rows (->> (html/select table [:tr]) (map #(html/select % [:td])) (map #(map html/text %)) (filterseq))]
Here's another view of this data. In this image, you can see some of the code from this web page. The variable names and select expressions are placed beside the HTML structures that they match. Hopefully, this makes it more clear how the select expressions correspond to the HTML elements:
Finally, convert everything to a dataset. incanter.core/dataset
is a lower level constructor than incanter.core/to-dataset
. It requires you to pass in the column names and data matrix as separate sequences:
(i/dataset headers rows)))
It's important to realize that the code, as presented here, is the result of a lot of trial and error. Screen scraping usually is. Generally, I download the page and save it, so I don't have to keep requesting it from the web server. Next, I start the REPL and parse the web page there. Then, I can take a look at the web page and HTML with the browser's view source function, and I can examine the data from the web page interactively in the REPL. While working, I copy and paste the code back and forth between the REPL and my text editor, as it's convenient. This workflow and environment (sometimes called REPL-driven-development) makes screen scraping (a fiddly, difficult task at the best of times) almost enjoyable.
See also
- The next recipe, Scraping textual data from web pages, has a more involved example of data scraping on an HTML page
- The Aggregating data from different formats recipe has a practical, real-life example of data scraping in a table