Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Mastering Data analysis with R
Mastering Data analysis with R

Mastering Data analysis with R: Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization

eBook
$32.99 $47.99
Paperback
$60.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Mastering Data analysis with R

Chapter 2. Getting Data from the Web

It happens pretty often that we want to use data in a project that is not yet available in our databases or on our disks, but can be found on the Internet. In such situations, one option might be to get the IT department or a data engineer at our company to extend our data warehouse to scrape, process, and load the data into our database as shown in the following diagram:

Getting Data from the Web

On the other hand, if we have no ETL system (to Extract, Transform, and Load data) or simply just cannot wait a few weeks for the IT department to implement our request, we are on our own. This is pretty standard for the data scientist, as most of the time we are developing prototypes that can be later transformed into products by software developers. To this end, a variety of skills are required in the daily round, including the following topics that we will cover in this chapter:

  • Downloading data programmatically from the Web
  • Processing XML and JSON formats
  • Scraping and parsing...

Loading datasets from the Internet

The most obvious task is to download datasets from the Web and load those into our R session in two manual steps:

  1. Save the datasets to disk.
  2. Read those with standard functions, such as read.table or for example foreign::read.spss, to import sav files.

But we can often save some time by skipping the first step and loading the flat text data files directly from the URL. The following example fetches a comma-separated file from the Americas Open Geocode (AOG) database at http://opengeocode.org, which contains the government, national statistics, geological information, and post office websites for the countries of the world:

> str(read.csv('http://opengeocode.org/download/CCurls.txt'))
'data.frame':  249 obs. of  5 variables:
 $ ISO.3166.1.A2                  : Factor w/ 248 levels "AD" ...
 $ Government.URL                 : Factor w/ 232 levels ""  ...
 $ National.Statistics.Census..URL: Factor w/ 213 levels &quot...

Other popular online data formats

Structured data is often available in XML or JSON formats on the Web. The high popularity of these two formats is due to the fact that both are human-readable, easy to handle from a programmatic point of view, and can manage any type of hierarchical data structure, not just a simple tabular design, as CSV files are.

Note

JSON is originally derived from JavaScript Object Notation, which recently became one of the top, most-used standards for human-readable data exchange format. JSON is considered to be a low-overhead alternative to XML with attribute-value pairs, although it also supports a wide variety of object types such as number, string, boolean, ordered lists, and associative arrays. JSON is highly used in Web applications, services, and APIs.

Of course, R also supports loading (and saving) data in JSON. Let's demonstrate that by fetching some data from the previous example via the Socrata API (more on that later in the R packages to interact with...

Reading data from HTML tables

According to the traditional document formats on the World Wide Web, most texts and data are served in HTML pages. We can often find interesting pieces of information in for example HTML tables, from which it's pretty easy to copy and paste data into an Excel spreadsheet, save that to disk, and load it to R afterwards. But it takes time, it's boring, and can be automated anyway.

Such HTML tables can be easily generated with the help of the aforementioned API of the Customer Compliant Database. If we do not set the required output format for which we used XML or JSON earlier, then the browser returns a HTML table instead, as you should be able to see in the following screenshot:

Reading data from HTML tables

Well, in the R console it's a bit more complicated as the browser sends some non-default HTTP headers while using curl, so the preceding URL would simply return a JSON list. To get HTML, let the server know that we expect HTML output. To do so, simply set the appropriate...

Scraping data from other online sources

Although the readHTMLTable function is very useful, sometimes the data is not structured in tables, but rather it's available only as HTML lists. Let's demonstrate such a data format by checking all the R packages listed in the relevant CRAN Task View at http://cran.r-project.org/web/views/WebTechnologies.html, as you can see in the following screenshot:

Scraping data from other online sources

So we see a HTML list of the package names along with a URL pointing to the CRAN, or in some cases to the GitHub repositories. To proceed, first we have to get acquainted a bit with the HTML sources to see how we can parse them. You can do that easily either in Chrome or Firefox: just right-click on the CRAN packages heading at the top of the list, and choose Inspect Element, as you can see in the following screenshot:

Scraping data from other online sources

So we have the list of related R packages in an ul (unordered list) HTML tag, just after the h3 (level 3 heading) tag holding the CRAN packages string.

In short:

  • We have to parse...

R packages to interact with data source APIs

Although it's great that we can read HTML tables, CSV files and JSON and XML data, and even parse raw HTML documents to store some parts of those in a dataset, there is no sense in spending too much time developing custom tools until we have no other option. First, always start with a quick look on the Web Technologies and Services CRAN Task View; also search R-bloggers, StackOverflow, and GitHub for any possible solution before getting your hands dirty with custom XPath selectors and JSON list magic.

Socrata Open Data API

Let's do this for our previous examples by searching for Socrata, the Open Data Application Program Interface of the Consumer Financial Protection Bureau. Yes, there is a package for that:

> library(RSocrata)
Loading required package: httr
Loading required package: RJSONIO

Attaching package: 'RJSONIO'

The following objects are masked from 'package:rjson':

    fromJSON, toJSON

As a matter of...

Loading datasets from the Internet


The most obvious task is to download datasets from the Web and load those into our R session in two manual steps:

  1. Save the datasets to disk.

  2. Read those with standard functions, such as read.table or for example foreign::read.spss, to import sav files.

But we can often save some time by skipping the first step and loading the flat text data files directly from the URL. The following example fetches a comma-separated file from the Americas Open Geocode (AOG) database at http://opengeocode.org, which contains the government, national statistics, geological information, and post office websites for the countries of the world:

> str(read.csv('http://opengeocode.org/download/CCurls.txt'))
'data.frame':  249 obs. of  5 variables:
 $ ISO.3166.1.A2                  : Factor w/ 248 levels "AD" ...
 $ Government.URL                 : Factor w/ 232 levels ""  ...
 $ National.Statistics.Census..URL: Factor w/ 213 levels ""  ...
 $ Geological.Information.URL     : Factor...
Left arrow icon Right arrow icon

Description

Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization About This Book Handle your data with precision and care for optimal business intelligence Restructure and transform your data to inform decision-making Packed with practical advice and tips to help you get to grips with data mining Who This Book Is For If you are a data scientist or R developer who wants to explore and optimize your use of R’s advanced features and tools, this is the book for you. A basic knowledge of R is required, along with an understanding of database logic. What You Will Learn Connect to and load data from R’s range of powerful databases Successfully fetch and parse structured and unstructured data Transform and restructure your data with efficient R packages Define and build complex statistical models with glm Develop and train machine learning algorithms Visualize social networks and graph data Deploy supervised and unsupervised classification algorithms Discover how to visualize spatial data with R In Detail R is an essential language for sharp and successful data analysis. Its numerous features and ease of use make it a powerful way of mining, managing, and interpreting large sets of data. In a world where understanding big data has become key, by mastering R you will be able to deal with your data effectively and efficiently. This book will give you the guidance you need to build and develop your knowledge and expertise. Bridging the gap between theory and practice, this book will help you to understand and use data for a competitive advantage. Beginning with taking you through essential data mining and management tasks such as munging, fetching, cleaning, and restructuring, the book then explores different model designs and the core components of effective analysis. You will then discover how to optimize your use of machine learning algorithms for classification and recommendation systems beside the traditional and more recent statistical methods. Style and approach Covering the essential tasks and skills within data science, Mastering Data Analysis provides you with solutions to the challenges of data science. Each section gives you a theoretical overview before demonstrating how to put the theory to work with real-world use cases and hands-on examples.

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 30, 2015
Length: 396 pages
Edition : 1st
Language : English
ISBN-13 : 9781783982035
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want

Product Details

Publication date : Sep 30, 2015
Length: 396 pages
Edition : 1st
Language : English
ISBN-13 : 9781783982035
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 170.97
Machine Learning with R
$54.99
R for Data Science
$54.99
Mastering Data analysis with R
$60.99
Total $ 170.97 Stars icon

Table of Contents

16 Chapters
1. Hello, Data! Chevron down icon Chevron up icon
2. Getting Data from the Web Chevron down icon Chevron up icon
3. Filtering and Summarizing Data Chevron down icon Chevron up icon
4. Restructuring Data Chevron down icon Chevron up icon
5. Building Models (authored by Renata Nemeth and Gergely Toth) Chevron down icon Chevron up icon
6. Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth) Chevron down icon Chevron up icon
7. Unstructured Data Chevron down icon Chevron up icon
8. Polishing Data Chevron down icon Chevron up icon
9. From Big to Small Data Chevron down icon Chevron up icon
10. Classification and Clustering Chevron down icon Chevron up icon
11. Social Network Analysis of the R Ecosystem Chevron down icon Chevron up icon
12. Analyzing Time-series Chevron down icon Chevron up icon
13. Data Around Us Chevron down icon Chevron up icon
14. Analyzing the R Community Chevron down icon Chevron up icon
A. References Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(2 Ratings)
5 star 50%
4 star 50%
3 star 0%
2 star 0%
1 star 0%
Fabien Deneuville Mar 27, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Ce livre est parfait pour les gens qui connaissent déjà bien R et veulent aller plus loin.J'ai découvert des choses que je ne soupçonnais pas, j'ai particulièrement apprécié le chapitre sur les graphes, celui sur les séries temporelles, le clustering... en fait il s'agit de zooms très utiles pour qui veut aller plus loin dans l'utilisation de R pour le data mining.
Amazon Verified review Amazon
Duncan W. Robinson Oct 29, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I found "Mastering Data Analysis with R" useful & very readable. I especially liked the sections dealing with obtaining, filtering & manipulating data. Though some of this information can be found in various R package vignettes, much of it, including useful tips, is all in one place here. For example, I found the section on using functions from the R package "dplyr" useful for merging data sets. The reader of this book might do well to follow-up this volume with Hadley Wickham’s "Advanced R". I also enjoyed the treatment given to importing the corpus for text analysis. Python generally seems to be a bit easier to navigate for the pre-processing step when performing text analytics. This book made conducting the process in R much easier for me.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.