First, you have to add Enlive to the dependencies in the project.clj
file:
Next, use these packages in your REPL or script:
Finally, identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-table.html, which looks like this:
It's intentionally stripped down, and it makes use of tables for layout (hence the comment about 1999).
The let
bindings in load-data
tell the story here. Let's talk about them one by one.
The first binding has Enlive download the resource and parse it into Enlive's internal representation:
The next binding selects the table with the data
ID:
Now, select of all the header cells from the table, extract the text from them, convert each to a keyword, and then convert the entire sequence into a vector. This gives headers for the dataset:
First, select each row individually. The next two steps are wrapped in map
so that the cells in each row stay grouped together. In these steps, select the data cells in each row and extract the text from each. Last, use filterseq
, which removes any rows with no data, such as the header row:
Here's another view of this data. In this image, you can see some of the code from this web page. The variable names and select expressions are placed beside the HTML structures that they match. Hopefully, this makes it more clear how the select expressions correspond to the HTML elements:
Finally, convert everything to a dataset. incanter.core/dataset
is a lower level constructor than incanter.core/to-dataset
. It requires you to pass in the column names and data matrix as separate sequences:
It's important to realize that the code, as presented here, is the result of a lot of trial and error. Screen scraping usually is. Generally, I download the page and save it, so I don't have to keep requesting it from the web server. Next, I start the REPL and parse the web page there. Then, I can take a look at the web page and HTML with the browser's view source function, and I can examine the data from the web page interactively in the REPL. While working, I copy and paste the code back and forth between the REPL and my text editor, as it's convenient. This workflow and environment (sometimes called REPL-driven-development) makes screen scraping (a fiddly, difficult task at the best of times) almost enjoyable.