Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
RStudio for R Statistical Computing Cookbook
RStudio for R Statistical Computing Cookbook

RStudio for R Statistical Computing Cookbook: Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

eBook
€8.99 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

RStudio for R Statistical Computing Cookbook

Chapter 2. Preparing for Analysis – Data Cleansing and Manipulation

In this chapter, we will cover the following topics:

  • Getting a sense of your data structure with R
  • Preparing your data for analysis with the tidyr package
  • Detecting missing values
  • Substituting missing values by interpolation
  • Detecting and removing outliers
  • Performing data filtering activities

Introduction

Some studies estimate that data preparation activities account for 80 percent of the time invested in data science projects.

I know you will not be surprised reading this number. Data preparation is the phase in data science projects where you take your data from the chaotic world around you and fit it into some precise structures and standards.

This is absolutely not a simple task and involves a great number of techniques that basically let you change the structure of your data and ensure you can work with it.

This chapter will show you recipes that should give you the ability to prepare the data you got from the previous chapter, no matter how it was structured when you acquired it in R.

We will look at the two main activities performed during the data preparation phase:

  • Data cleansing: This involves identification and treatment of outliers and missing values
  • Data manipulation: Here, the main aim is to make the data structure fit some specific rule, which will let the user employ...

Getting a sense of your data structure with R

By following the recipes given in the previous chapter, you got your data. Everything went smoothly, and you may also already have the data as a data frame object.

However, do you know what your data looks like?

Getting to know your data structure is a crucial step within a data analysis project. It will suggest the appropriate treatment and analysis, and will help you avoid error and redundancy in the coding activity that follows.

In this recipe, we will look at a dataset structure by leveraging the describe() function from the Hmisc package. For further preliminary analysis on your data structure, you can also refer to the data visualization recipes in Chapter 3, Basic Visualization Techniques.

Getting ready

This example will be built around a dataset provided in the RStudio project related to this book.

You can download it by authenticating your account at http://packtpub.com.

This dataset is named world_gdp_data.csv and stores GDP values for 248...

Preparing your data for analysis with the tidyr package

The tidyr package is another gift from Hadley Wickham. This package provides functions to make your data tidy.

This means that after applying the tidyr package's function, your data you will be arranged as per the following rules:

  • Each column will contain an attribute
  • Each row will contain an observation
  • Each cell will contain a value

These rules will produce a dataset similar to the following one:

Preparing your data for analysis with the tidyr package

This structure, besides giving you a clearer understanding of your data, will let you work with it more easily.

Furthermore, this structure will let you take full advantage of the inner R-vectorized structure. This recipe will show you how to apply the gather function to a dataset in order to transform a dataset and make it comply with the cited rules.

The employed data frame is in the so-called wide format, where each period of observation is stored in columns, with each column representing a year, as follows:

Preparing your data for analysis with the tidyr package

Getting ready

In order to let...

Detecting and removing missing values

Missing values are values that should have been recorded but, for some reason, weren't actually recorded. Those values are different, from values without meaning, represented in R with NaN (not a number).

Most of us understood missing values due to circumstances such as the following one:

> x <- c(1,2,3,NA,4)
> mean(x)
[1] NA

"Oh come on, I know you can do it. Just ignore that useless NA" was probably your reaction, or at least it was mine.

Fortunately, R comes packed with good functions for missing value detection and handling.

In this recipe and the following one, we will see two opposite approaches to missing value handling:

  • Removing missing values
  • Simulating missing values by interpolation

I have to warn you that removing missing values can be considered right in a really small number of cases, since it compromises the integrity of your data sources and can greatly reduce the reliability of your results.

Nevertheless, if you are...

Substituting missing values using the mice package

Finding and removing missing values in your dataset is not always a viable alternative, for either operative or methodological reasons. It is often preferable to simulate possible values for missing data and integrate those values within the observed data.

This recipe is based on the mice package by Stef van Buuren. It provides an efficient algorithm for missing value substitution based on the multiple imputation technique.

Note

Multiple imputation technique

The multiple imputation technique is a statistical solution to the problem of missing values.

The main idea behind this technique is to draw possible alternative values for each missing value and then, after a proper analysis of simulated values, populating the original dataset with synthetic data.

Getting ready

This recipe requires that you install and load the mice package:

install.packages("mice")
library(mice)

For illustrative purposes, we will use the tidy_gdp data frame created...

Detecting and removing outliers

Outliers are usually dangerous values for data science activities, since they produce heavy distortions within models and algorithms.

Their detection and exclusion is, therefore, a really crucial task.

This recipe will show you how to easily perform this task.

We will compute the I and IV quartiles of a given population and detect values that far from these fixed limits.

You should note that this recipe is feasible only for univariate quantitative population, while different kind of data will require you to use other outlier-detection methods.

How to do it...

  1. Compute the quantiles using the quantile() function:
    quantiles <- quantile(tidy_gdp_complete$gdp, probs = c(.25, .75))
    
  2. Compute the range value using the IQR() function:
    range <- 1.5 * IQR(tidy_gdp_complete$gdp)
    
  3. Subset the original data by excluding the outliers:
    normal_gdp <- subset(tidy_gdp_complete,
    tidy_gdp_complete$gdp > (quantiles[1] - range) & tidy_gdp_complete$gdp < (quantiles[2] + range...

Introduction


Some studies estimate that data preparation activities account for 80 percent of the time invested in data science projects.

I know you will not be surprised reading this number. Data preparation is the phase in data science projects where you take your data from the chaotic world around you and fit it into some precise structures and standards.

This is absolutely not a simple task and involves a great number of techniques that basically let you change the structure of your data and ensure you can work with it.

This chapter will show you recipes that should give you the ability to prepare the data you got from the previous chapter, no matter how it was structured when you acquired it in R.

We will look at the two main activities performed during the data preparation phase:

  • Data cleansing: This involves identification and treatment of outliers and missing values

  • Data manipulation: Here, the main aim is to make the data structure fit some specific rule, which will let the user employ...

Left arrow icon Right arrow icon

Key benefits

  • 54 useful and practical tasks to improve working systems
  • Includes optimizing performance and reliability or uptime, reporting, system management tools, interfacing to standard data ports, and so on
  • Offers 10-15 real-life, practical improvements for each user type

Description

The requirement of handling complex datasets, performing unprecedented statistical analysis, and providing real-time visualizations to businesses has concerned statisticians and analysts across the globe. RStudio is a useful and powerful tool for statistical analysis that harnesses the power of R for computational statistics, visualization, and data science, in an integrated development environment. This book is a collection of recipes that will help you learn and understand RStudio features so that you can effectively perform statistical analysis and reporting, code editing, and R development. The first few chapters will teach you how to set up your own data analysis project in RStudio, acquire data from different data sources, and manipulate and clean data for analysis and visualization purposes. You'll get hands-on with various data visualization methods using ggplot2, and you will create interactive and multidimensional visualizations with D3.js. Additional recipes will help you optimize your code; implement various statistical models to manage large datasets; perform text analysis and predictive analysis; and master time series analysis, machine learning, forecasting; and so on. In the final few chapters, you'll learn how to create reports from your analytical application with the full range of static and dynamic reporting tools that are available in RStudio so that you can effectively communicate results and even transform them into interactive web applications.

Who is this book for?

This book is targeted at R statisticians, data scientists, and R programmers. Readers with R experience who are looking to take the plunge into statistical computing will find this Cookbook particularly indispensable.

What you will learn

  • Familiarize yourself with the latest advanced R console features
  • Create advanced and interactive graphics
  • Manage your R project and project files effectively
  • Perform reproducible statistical analyses in your R projects
  • Use RStudio to design predictive models for a specific domain-based application
  • Use RStudio to effectively communicate your analyses results and even publish them to a blog
  • Put yourself on the frontiers of data science and data monetization in R with all the tools that are needed to effectively communicate your results and even transform your work into a data product

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 29, 2016
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781784396947
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Apr 29, 2016
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781784396947
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 108.97
Learning Probabilistic Graphical Models in R
€29.99
R Machine Learning By Example
€41.99
RStudio for R Statistical Computing Cookbook
€36.99
Total 108.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Acquiring Data for Your Project Chevron down icon Chevron up icon
2. Preparing for Analysis – Data Cleansing and Manipulation Chevron down icon Chevron up icon
3. Basic Visualization Techniques Chevron down icon Chevron up icon
4. Advanced and Interactive Visualization Chevron down icon Chevron up icon
5. Power Programming with R Chevron down icon Chevron up icon
6. Domain-specific Applications Chevron down icon Chevron up icon
7. Developing Static Reports Chevron down icon Chevron up icon
8. Dynamic Reporting and Web Application Development Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(2 Ratings)
5 star 0%
4 star 100%
3 star 0%
2 star 0%
1 star 0%
Mr. T. Sep 03, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Andrea's book provides some useful and well explained recipes on how to produce effective and clear data analyses. Personally I particularly appreciated the Dynamic reporting techniques as well as the interactive visualisation opportunities which were unknown to me so far.
Amazon Verified review Amazon
Amazon Customer Jul 20, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The subheading “quick answers to common problem” is the more appropriate description of this book.I am using R for more than fourteen years, in academic and professional environments as well. Even if I’m a kind of “old school” command line coder, I’m continuously intrigued and delighted by the new and fresh packages made available for R. As many of us, from time to time, I felt lost among the packages and in their usage. Andrea’s book can provide some help. It is a real cookbook made up by simple, maybe trivial, ready-to-use recipes. It is the collection of code chunks that everybody should collect: Andrea shared their own.I do not have time to check any update in the visualization techniques, so I appreciated the extensive care on how to communicate the results, with nice graphs, and with some “novelty” as the Sankey or the wordcloud, or via a Shiny app. On the other side, some arguments are just barely and sadly mentioned, in fact they are so extended that deserve entire books; I’m referring to the outliers detection, the parallel computation or the sentiment analysis…This book will not teach you statistics or R. To read and enjoy this book, you need to have at least an average knowledge of R and moreover to had faced some troubles with data analysis and visualization.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.