Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
RStudio for R Statistical Computing Cookbook

You're reading from   RStudio for R Statistical Computing Cookbook Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

Arrow left icon
Product type Paperback
Published in Apr 2016
Publisher
ISBN-13 9781784391034
Length 246 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Andrea Cirillo Andrea Cirillo
Author Profile Icon Andrea Cirillo
Andrea Cirillo
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. Acquiring Data for Your Project 2. Preparing for Analysis – Data Cleansing and Manipulation FREE CHAPTER 3. Basic Visualization Techniques 4. Advanced and Interactive Visualization 5. Power Programming with R 6. Domain-specific Applications 7. Developing Static Reports 8. Dynamic Reporting and Web Application Development Index

Acquiring data from the Web – web scraping tasks

Given the advances in the Internet of Things (IoT) and the progress of cloud computing, we can quietly affirm that in future, a huge part of our data will be available through the Internet, which on the other hand doesn't mean it will be public.

It is, therefore, crucial to know how to take that data from the Web and load it into your analytical environment.

You can find data on the Web either in the form of data statically stored on websites (that is, tables on Wikipedia or similar websites) or in the form of data stored on the cloud, which is accessible via APIs.

For API recipes, we will go through all the steps you need to get data statically exposed on websites in the form of tabular and nontabular data.

This specific example will show you how to get data from a specific Wikipedia page, the one about the R programming language: https://en.wikipedia.org/wiki/R_(programming_language).

Getting ready

Data statically exposed on web pages is actually pieces of web page code. Getting them from the Web to our R environment requires us to read that code and find where exactly the data is.

Dealing with complex web pages can become a really challenging task, but luckily, SelectorGadget was developed to help you with this job. SelectorGadget is a bookmarklet, developed by Andrew Cantino and Kyle Maxwell, that lets you easily figure out the CSS selector of your data on the web page you are looking at. Basically, the CSS selector can be seen as the address of your data on the web page, and you will need it within the R code that you are going to write to scrape your data from the Web (refer to the next paragraph).

Note

The CSS selector is the token that is used within the CSS code to identify elements of the HTML code based on their name.

CSS selectors are used within the CSS code to identify which elements are to be styled using a given piece of CSS code. For instance, the following script will align all elements (CSS selector *) with 0 margin and 0 padding:

* {
margin: 0;
padding: 0;
}

SelectorGadget is currently employable only via the Chrome browser, so you will need to install the browser before carrying on with this recipe. You can download and install the last version of Chrome from https://www.google.com/chrome/.

SelectorGadget is available as a Chrome extension; navigate to the following URL while already on the page showing the data you need:

:javascript:(function(){
  var%20s=document.createElement('div');
  s.innerHTML='Loading…'
  ;s.style.color='black';
  s.style.padding='20px';
  s.style.position='fixed';
  s.style.zIndex='9999';
  s.style.fontSize='3.0em';
  s.style.border='2px%20solid%20black';
  s.style.right='40px';
  s.style.top='40px';
  s.setAttribute('class','selector_gadget_loading');
  s.style.background='white';
    document.body.appendChild(s);
    s=document.createElement('script');
    s.setAttribute('type','text/javascript');
    s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js');document.body.appendChild(s);
})();

This long URL shows that the CSS selector is provided as JavaScript; you can make this out from the :javascript: token at the very beginning.

We can further analyze the URL by decomposing it into three main parts, which are as follows:

The .js file is where the CSS selector's core functionalities are actually defined and the place where they are taken to make them available to users.

That being said, I'm not suggesting that you try to use this link to employ SelectorGadget for your web scraping purposes, but I would rather suggest that you look for the Chrome extension or at the official SelectorGadget page, http://selectorgadget.com. Once you find the link on the official page, save it as a bookmark so that it is easily available when you need it.

The other tool we are going to use in this recipe is the rvest package, which offers great web scraping functionalities within the R environment.

To make it available, you first have to install and load it in the global environment that runs the following:

install.packages("rvest")
library(rvest)

How to do it...

  1. Run SelectorGadget. To do so, after navigating to the web page you are interested in, activate SelectorGadget by running the Chrome extension or clicking on the bookmark that we previously saved.

    In both cases, after activating the gadget, a Loading… message will appear, and then, you will find a bar on the bottom-right corner of your web browser, as shown in the following screenshot:

    How to do it...

    You are now ready to select the data you are interested in.

  2. Select the data you are interested in. After clicking on the data you are going to scrape, you will note that beside the data you've selected, there are some other parts on the page that will turn yellow:
    How to do it...

    This is because SelectorGadget is trying to guess what you are looking at by highlighting all the elements included in the CSS selector that it considers to be most useful for you.

    If it is guessing wrong, you just have to click on the wrongly highlighted parts and those will turn red:

    How to do it...

    When you are done with this fine-tuning process, SelectorGadget will have correctly identified a proper selector, and you can move on to the next step.

  3. Find your data location on the page. To do this, all you have to do is copy the CSS selector that you will find in the bar at the bottom-right corner:
    How to do it...

    This piece of text will be all you need in order to scrape the web page from R.

  4. The next step is to read data from the Web with the rvest package. The rvest package by Hadley Wickham is one of the most comprehensive packages for web scraping activities in R. Take a look at the There's more... section for further information on package objectives and functionalities.

    For now, it is enough to know that the rvest package lets you download HTML code and read the data stored within the code easily.

    Now, we need to import the HTML code from the web page. First of all, we need to define an object storing all the html code of the web page you are looking at:

    page_source <-  read_html('https://en.wikipedia.org/wiki/R_(programming_language)

    This code leverages read_html function(), which retrieves the source code that resides at the written URL directly from the Web.

  5. Next, we will select the defined blocks. Once you have got your HTML code, it is time to extract the part of the code you are interested in. This is done using the html_nodes() function, which is passed as an argument in the CSS selector and retrieved using SelectorGadget. This will result in a line of code similar to the following:
    version_block     <- html_nodes(page_source,".wikitable th , .wikitable td")

    As you can imagine, this code extracts all the content of the selected nodes, including HTML tags.

    Note

    The HTML language

    HyperText Markup Language (HTML) is a markup language that is used to define the format of web pages.

    The basic idea behind HTML is to structure the web page into a format with a head and body, each of which contains a variable number of tags, which can be considered as subcomponents of the structure.

    The head is used to store information and components that will not be seen by the user but will affect the web page's behavior, for instance, in a Google Analytics script used for tracking page visits, the body contains all the contents which will be showed to the reader.

    Since the HTML code is composed of a nested structure, it is common to compare this structure to a tree, and here, different components are also referred to as nodes.

    Printing out the version_block object, you will obtain a result similar to the following:

    print(version_block)  
    
    {xml_nodeset (45)}
     [1] <th>Release</th>
     [2] <th>Date</th>
     [3] <th>Description</th>
     [4] <th>0.16</th>
     [5] <td/>
     [6] <td>This is the last <a href="/wiki/Alpha_test" title="Alpha test" class="mw-redirect">alp ...
     [7] <th>0.49</th>
     [8] <td style="white-space:nowrap;">1997-04-23</td>
     [9] <td>This is the oldest available <a href="/wiki/Source_code" title="Source code">source</a ...
    [10] <th>0.60</th>
    [11] <td>1997-12-05</td>
    [12] <td>R becomes an official part of the <a href="/wiki/GNU_Project" title="GNU Project">GNU  ...
    [13] <th>1.0</th>
    [14] <td>2000-02-29</td>
    [15] <td>Considered by its developers stable enough for production use.<sup id="cite_ref-35" cl ...
    [16] <th>1.4</th>
    [17] <td>2001-12-19</td>
    [18] <td>S4 methods are introduced and the first version for <a href="/wiki/Mac_OS_X" title="Ma ...
    [19] <th>2.0</th>
    [20] <td>2004-10-04</td>

    This result is not exactly what you are looking for if you are going to work with this data. However, you don't have to worry about that since we are going to give your text a better shape in the very next step.

  6. In order to obtain a readable and actionable format, we need one more step: extracting text from HTML tags.

    This can be done using the html_text() function, which will result in a list containing all the text present within the HTML tags:

    content <- html_text(version_block)

    The final result will be a perfectly workable chunk of text containing the data needed for our analysis:

    [1] "Release"                                                                                                                                                                                                                                                                                  
     [2] "Date"                                                                                                                                                                                                                                                                                     
     [3] "Description"                                                                                                                                                                                                                                                                              
     [4] "0.16"                                                                                                                                                                                                                                                                                     
     [5] ""                                                                                                                                                                                                                                                                                         
     [6] "This is the last alpha version developed primarily by Ihaka and Gentleman. Much of the basic functionality from the \"White Book\" (see S history) was implemented. The mailing lists commenced on April 1, 1997."                                                                        
     [7] "0.49"                                                                                                                                                                                                                                                                                     
     [8] "1997-04-23"                                                                                                                                                                                                                                                                               
     [9] "This is the oldest available source release, and compiles on a limited number of Unix-like platforms. CRAN is started on this date, with 3 mirrors that initially hosted 12 packages. Alpha versions of R for Microsoft Windows and Mac OS are made available shortly after this version."
    [10] "0.60"                                                                                                                                                                                                                                                                                     
    [11] "1997-12-05"                                                                                                                                                                                                                                                                               
    [12] "R becomes an official part of the GNU Project. The code is hosted and maintained on CVS."                                                                                                                                                                                                 
    [13] "1.0"                                                                                                                                                                                                                                                                                      
    [14] "2000-02-29"                                                                                                                                                                                                                                                                               
    [15] "Considered by its developers stable enough for production use.[35]"                                                                                                                                                                                                                       
    [16] "1.4"                                                                                                                                                                                                                                                                                      
    [17] "2001-12-19"                                                                                                                                                                                                                                                                               
    [18] "S4 methods are introduced and the first version for Mac OS X is made available soon after."                                                                                                                                                                                               
    [19] "2.0"                                                                                                                                                                                                                                                                                      
    [20] "2004-10-04"                                                                                                                                                                                                                                                                               
    [21] "Introduced lazy loading, which enables fast loading of data with minimal expense of system memory."                                                                                                                                                                                       
    [22] "2.1"                                                                                                                                                                                                                                                                                      
    [23] "2005-04-18"                                                                                                                                                                                                                                                                               
    [24] "Support for UTF-8 encoding, and the beginnings of internationalization and localization for different languages."                                                                                                                                                                         
    [25] "2.11"                                                                                                                                                                                                                                                                                     
    [26] "2010-04-22"                                                                                                                                                                                                                                                                               
    [27] "Support for Windows 64 bit systems."                                                                                                                                                                                                                                                      
    [28] "2.13"                                                                                                                                                                                                                                                                                     
    [29] "2011-04-14"                                                                                                                                                                                                                                                                               
    [30] "Adding a new compiler function that allows speeding up functions by converting them to byte-code."                                                                                                                                                                                        
    [31] "2.14"                                                                                                                                                                                                                                                                                     
    [32] "2011-10-31"                                                                                                                                                                                                                                                                               
    [33] "Added mandatory namespaces for packages. Added a new parallel package."                                                                                                                                                                                                                   
    [34] "2.15"                                                                                                                                                                                                                                                                                     
    [35] "2012-03-30"                                                                                                                                                                                                                                                                               
    [36] "New load balancing functions. Improved serialization speed for long vectors."                                                                                                                                                                                                             
    [37] "3.0"                                                                                                                                                                                                                                                                                      
    [38] "2013-04-03"                                                                                                                                                                                                                                                                               
    [39] "Support for numeric index values 231 and larger on 64 bit systems."                                                                                                                                                                                                                       
    [40] "3.1"                                                                                                                                                                                                                                                                                      
    [41] "2014-04-10"                                                                                                                                                                                                                                                                               
    [42] ""                                                                                                                                                                                                                                                                                         
    [43] "3.2"                                                                                                                                                                                                                                                                                      
    [44] "2015-04-16"                                                                                                                                                                                                                                                                               
    [45] ""    

There's more...

The following are a few useful resources that will help you get the most out of this recipe:

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image