Acquiring data from the Web – web scraping tasks
Given the advances in the Internet of Things (IoT) and the progress of cloud computing, we can quietly affirm that in future, a huge part of our data will be available through the Internet, which on the other hand doesn't mean it will be public.
It is, therefore, crucial to know how to take that data from the Web and load it into your analytical environment.
You can find data on the Web either in the form of data statically stored on websites (that is, tables on Wikipedia or similar websites) or in the form of data stored on the cloud, which is accessible via APIs.
For API recipes, we will go through all the steps you need to get data statically exposed on websites in the form of tabular and nontabular data.
This specific example will show you how to get data from a specific Wikipedia page, the one about the R programming language: https://en.wikipedia.org/wiki/R_(programming_language).
Getting ready
Data statically exposed on web pages is actually pieces of web page code. Getting them from the Web to our R environment requires us to read that code and find where exactly the data is.
Dealing with complex web pages can become a really challenging task, but luckily, SelectorGadget was developed to help you with this job. SelectorGadget is a bookmarklet, developed by Andrew Cantino and Kyle Maxwell, that lets you easily figure out the CSS selector of your data on the web page you are looking at. Basically, the CSS selector can be seen as the address of your data on the web page, and you will need it within the R code that you are going to write to scrape your data from the Web (refer to the next paragraph).
Note
The CSS selector is the token that is used within the CSS code to identify elements of the HTML code based on their name.
CSS selectors are used within the CSS code to identify which elements are to be styled using a given piece of CSS code. For instance, the following script will align all elements (CSS selector *
) with 0
margin and 0
padding:
* { margin: 0; padding: 0; }
SelectorGadget is currently employable only via the Chrome browser, so you will need to install the browser before carrying on with this recipe. You can download and install the last version of Chrome from https://www.google.com/chrome/.
SelectorGadget is available as a Chrome extension; navigate to the following URL while already on the page showing the data you need:
:javascript:(function(){ var%20s=document.createElement('div'); s.innerHTML='Loading…' ;s.style.color='black'; s.style.padding='20px'; s.style.position='fixed'; s.style.zIndex='9999'; s.style.fontSize='3.0em'; s.style.border='2px%20solid%20black'; s.style.right='40px'; s.style.top='40px'; s.setAttribute('class','selector_gadget_loading'); s.style.background='white'; document.body.appendChild(s); s=document.createElement('script'); s.setAttribute('type','text/javascript'); s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js');document.body.appendChild(s); })();
This long URL shows that the CSS selector is provided as JavaScript; you can make this out from the :javascript:
token at the very beginning.
We can further analyze the URL by decomposing it into three main parts, which are as follows:
- Creation on the page of a new element of the
div
class with thedocument.createElement('div')
statement - Aesthetic attributes setting, composed by all the
s.style…
tokens - The
.js
file content retrieving at https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js
The .js
file is where the CSS selector's core functionalities are actually defined and the place where they are taken to make them available to users.
That being said, I'm not suggesting that you try to use this link to employ SelectorGadget for your web scraping purposes, but I would rather suggest that you look for the Chrome extension or at the official SelectorGadget page, http://selectorgadget.com. Once you find the link on the official page, save it as a bookmark so that it is easily available when you need it.
The other tool we are going to use in this recipe is the rvest
package, which offers great web scraping functionalities within the R environment.
To make it available, you first have to install and load it in the global environment that runs the following:
install.packages("rvest") library(rvest)
How to do it...
- Run SelectorGadget. To do so, after navigating to the web page you are interested in, activate SelectorGadget by running the Chrome extension or clicking on the bookmark that we previously saved.
In both cases, after activating the gadget, a Loading… message will appear, and then, you will find a bar on the bottom-right corner of your web browser, as shown in the following screenshot:
You are now ready to select the data you are interested in.
- Select the data you are interested in. After clicking on the data you are going to scrape, you will note that beside the data you've selected, there are some other parts on the page that will turn yellow:
This is because SelectorGadget is trying to guess what you are looking at by highlighting all the elements included in the CSS selector that it considers to be most useful for you.
If it is guessing wrong, you just have to click on the wrongly highlighted parts and those will turn red:
When you are done with this fine-tuning process, SelectorGadget will have correctly identified a proper selector, and you can move on to the next step.
- Find your data location on the page. To do this, all you have to do is copy the CSS selector that you will find in the bar at the bottom-right corner:
This piece of text will be all you need in order to scrape the web page from R.
- The next step is to read data from the Web with the
rvest
package. Thervest
package by Hadley Wickham is one of the most comprehensive packages for web scraping activities in R. Take a look at the There's more... section for further information on package objectives and functionalities.For now, it is enough to know that the
rvest
package lets you download HTML code and read the data stored within the code easily.Now, we need to import the HTML code from the web page. First of all, we need to define an object storing all the html code of the web page you are looking at:
page_source <- read_html('https://en.wikipedia.org/wiki/R_(programming_language)
This code leverages
read_html function()
, which retrieves the source code that resides at the written URL directly from the Web. - Next, we will select the defined blocks. Once you have got your HTML code, it is time to extract the part of the code you are interested in. This is done using the
html_nodes()
function, which is passed as an argument in the CSS selector and retrieved using SelectorGadget. This will result in a line of code similar to the following:version_block <- html_nodes(page_source,".wikitable th , .wikitable td")
As you can imagine, this code extracts all the content of the selected nodes, including HTML tags.
Note
The HTML language
HyperText Markup Language (HTML) is a markup language that is used to define the format of web pages.
The basic idea behind HTML is to structure the web page into a format with a head and body, each of which contains a variable number of tags, which can be considered as subcomponents of the structure.
The head is used to store information and components that will not be seen by the user but will affect the web page's behavior, for instance, in a Google Analytics script used for tracking page visits, the body contains all the contents which will be showed to the reader.
Since the HTML code is composed of a nested structure, it is common to compare this structure to a tree, and here, different components are also referred to as nodes.
Printing out the
version_block
object, you will obtain a result similar to the following:print(version_block) {xml_nodeset (45)} [1] <th>Release</th> [2] <th>Date</th> [3] <th>Description</th> [4] <th>0.16</th> [5] <td/> [6] <td>This is the last <a href="/wiki/Alpha_test" title="Alpha test" class="mw-redirect">alp ... [7] <th>0.49</th> [8] <td style="white-space:nowrap;">1997-04-23</td> [9] <td>This is the oldest available <a href="/wiki/Source_code" title="Source code">source</a ... [10] <th>0.60</th> [11] <td>1997-12-05</td> [12] <td>R becomes an official part of the <a href="/wiki/GNU_Project" title="GNU Project">GNU ... [13] <th>1.0</th> [14] <td>2000-02-29</td> [15] <td>Considered by its developers stable enough for production use.<sup id="cite_ref-35" cl ... [16] <th>1.4</th> [17] <td>2001-12-19</td> [18] <td>S4 methods are introduced and the first version for <a href="/wiki/Mac_OS_X" title="Ma ... [19] <th>2.0</th> [20] <td>2004-10-04</td>
This result is not exactly what you are looking for if you are going to work with this data. However, you don't have to worry about that since we are going to give your text a better shape in the very next step.
- In order to obtain a readable and actionable format, we need one more step: extracting text from HTML tags.
This can be done using the
html_text()
function, which will result in a list containing all the text present within the HTML tags:content <- html_text(version_block)
The final result will be a perfectly workable chunk of text containing the data needed for our analysis:
[1] "Release" [2] "Date" [3] "Description" [4] "0.16" [5] "" [6] "This is the last alpha version developed primarily by Ihaka and Gentleman. Much of the basic functionality from the \"White Book\" (see S history) was implemented. The mailing lists commenced on April 1, 1997." [7] "0.49" [8] "1997-04-23" [9] "This is the oldest available source release, and compiles on a limited number of Unix-like platforms. CRAN is started on this date, with 3 mirrors that initially hosted 12 packages. Alpha versions of R for Microsoft Windows and Mac OS are made available shortly after this version." [10] "0.60" [11] "1997-12-05" [12] "R becomes an official part of the GNU Project. The code is hosted and maintained on CVS." [13] "1.0" [14] "2000-02-29" [15] "Considered by its developers stable enough for production use.[35]" [16] "1.4" [17] "2001-12-19" [18] "S4 methods are introduced and the first version for Mac OS X is made available soon after." [19] "2.0" [20] "2004-10-04" [21] "Introduced lazy loading, which enables fast loading of data with minimal expense of system memory." [22] "2.1" [23] "2005-04-18" [24] "Support for UTF-8 encoding, and the beginnings of internationalization and localization for different languages." [25] "2.11" [26] "2010-04-22" [27] "Support for Windows 64 bit systems." [28] "2.13" [29] "2011-04-14" [30] "Adding a new compiler function that allows speeding up functions by converting them to byte-code." [31] "2.14" [32] "2011-10-31" [33] "Added mandatory namespaces for packages. Added a new parallel package." [34] "2.15" [35] "2012-03-30" [36] "New load balancing functions. Improved serialization speed for long vectors." [37] "3.0" [38] "2013-04-03" [39] "Support for numeric index values 231 and larger on 64 bit systems." [40] "3.1" [41] "2014-04-10" [42] "" [43] "3.2" [44] "2015-04-16" [45] ""
There's more...
The following are a few useful resources that will help you get the most out of this recipe:
- A useful list of HTML tags, to show you how HTML files are structured and how to identify code that you need to get from these files, is provided at http://www.w3schools.com/tags/tag_code.asp
- The blog post from the RStudio guys introducing the
rvest
package and highlighting some package functionalities can be found at http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/