Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
RStudio for R Statistical Computing Cookbook
RStudio for R Statistical Computing Cookbook

RStudio for R Statistical Computing Cookbook: Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

Arrow left icon
Profile Icon Andrea Cirillo
Arrow right icon
Can$61.99
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Apr 2016 246 pages 1st Edition
eBook
Can$12.99 Can$49.99
Paperback
Can$61.99
Subscription
Free Trial
Arrow left icon
Profile Icon Andrea Cirillo
Arrow right icon
Can$61.99
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Apr 2016 246 pages 1st Edition
eBook
Can$12.99 Can$49.99
Paperback
Can$61.99
Subscription
Free Trial
eBook
Can$12.99 Can$49.99
Paperback
Can$61.99
Subscription
Free Trial

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

RStudio for R Statistical Computing Cookbook

Chapter 2. Preparing for Analysis – Data Cleansing and Manipulation

In this chapter, we will cover the following topics:

  • Getting a sense of your data structure with R
  • Preparing your data for analysis with the tidyr package
  • Detecting missing values
  • Substituting missing values by interpolation
  • Detecting and removing outliers
  • Performing data filtering activities

Introduction

Some studies estimate that data preparation activities account for 80 percent of the time invested in data science projects.

I know you will not be surprised reading this number. Data preparation is the phase in data science projects where you take your data from the chaotic world around you and fit it into some precise structures and standards.

This is absolutely not a simple task and involves a great number of techniques that basically let you change the structure of your data and ensure you can work with it.

This chapter will show you recipes that should give you the ability to prepare the data you got from the previous chapter, no matter how it was structured when you acquired it in R.

We will look at the two main activities performed during the data preparation phase:

  • Data cleansing: This involves identification and treatment of outliers and missing values
  • Data manipulation: Here, the main aim is to make the data structure fit some specific rule, which will let the user employ...

Getting a sense of your data structure with R

By following the recipes given in the previous chapter, you got your data. Everything went smoothly, and you may also already have the data as a data frame object.

However, do you know what your data looks like?

Getting to know your data structure is a crucial step within a data analysis project. It will suggest the appropriate treatment and analysis, and will help you avoid error and redundancy in the coding activity that follows.

In this recipe, we will look at a dataset structure by leveraging the describe() function from the Hmisc package. For further preliminary analysis on your data structure, you can also refer to the data visualization recipes in Chapter 3, Basic Visualization Techniques.

Getting ready

This example will be built around a dataset provided in the RStudio project related to this book.

You can download it by authenticating your account at http://packtpub.com.

This dataset is named world_gdp_data.csv and stores GDP values for 248...

Preparing your data for analysis with the tidyr package

The tidyr package is another gift from Hadley Wickham. This package provides functions to make your data tidy.

This means that after applying the tidyr package's function, your data you will be arranged as per the following rules:

  • Each column will contain an attribute
  • Each row will contain an observation
  • Each cell will contain a value

These rules will produce a dataset similar to the following one:

Preparing your data for analysis with the tidyr package

This structure, besides giving you a clearer understanding of your data, will let you work with it more easily.

Furthermore, this structure will let you take full advantage of the inner R-vectorized structure. This recipe will show you how to apply the gather function to a dataset in order to transform a dataset and make it comply with the cited rules.

The employed data frame is in the so-called wide format, where each period of observation is stored in columns, with each column representing a year, as follows:

Preparing your data for analysis with the tidyr package

Getting ready

In order to let...

Detecting and removing missing values

Missing values are values that should have been recorded but, for some reason, weren't actually recorded. Those values are different, from values without meaning, represented in R with NaN (not a number).

Most of us understood missing values due to circumstances such as the following one:

> x <- c(1,2,3,NA,4)
> mean(x)
[1] NA

"Oh come on, I know you can do it. Just ignore that useless NA" was probably your reaction, or at least it was mine.

Fortunately, R comes packed with good functions for missing value detection and handling.

In this recipe and the following one, we will see two opposite approaches to missing value handling:

  • Removing missing values
  • Simulating missing values by interpolation

I have to warn you that removing missing values can be considered right in a really small number of cases, since it compromises the integrity of your data sources and can greatly reduce the reliability of your results.

Nevertheless, if you are...

Substituting missing values using the mice package

Finding and removing missing values in your dataset is not always a viable alternative, for either operative or methodological reasons. It is often preferable to simulate possible values for missing data and integrate those values within the observed data.

This recipe is based on the mice package by Stef van Buuren. It provides an efficient algorithm for missing value substitution based on the multiple imputation technique.

Note

Multiple imputation technique

The multiple imputation technique is a statistical solution to the problem of missing values.

The main idea behind this technique is to draw possible alternative values for each missing value and then, after a proper analysis of simulated values, populating the original dataset with synthetic data.

Getting ready

This recipe requires that you install and load the mice package:

install.packages("mice")
library(mice)

For illustrative purposes, we will use the tidy_gdp data frame created...

Detecting and removing outliers

Outliers are usually dangerous values for data science activities, since they produce heavy distortions within models and algorithms.

Their detection and exclusion is, therefore, a really crucial task.

This recipe will show you how to easily perform this task.

We will compute the I and IV quartiles of a given population and detect values that far from these fixed limits.

You should note that this recipe is feasible only for univariate quantitative population, while different kind of data will require you to use other outlier-detection methods.

How to do it...

  1. Compute the quantiles using the quantile() function:
    quantiles <- quantile(tidy_gdp_complete$gdp, probs = c(.25, .75))
    
  2. Compute the range value using the IQR() function:
    range <- 1.5 * IQR(tidy_gdp_complete$gdp)
    
  3. Subset the original data by excluding the outliers:
    normal_gdp <- subset(tidy_gdp_complete,
    tidy_gdp_complete$gdp > (quantiles[1] - range) & tidy_gdp_complete$gdp < (quantiles[2] + range...

Introduction


Some studies estimate that data preparation activities account for 80 percent of the time invested in data science projects.

I know you will not be surprised reading this number. Data preparation is the phase in data science projects where you take your data from the chaotic world around you and fit it into some precise structures and standards.

This is absolutely not a simple task and involves a great number of techniques that basically let you change the structure of your data and ensure you can work with it.

This chapter will show you recipes that should give you the ability to prepare the data you got from the previous chapter, no matter how it was structured when you acquired it in R.

We will look at the two main activities performed during the data preparation phase:

  • Data cleansing: This involves identification and treatment of outliers and missing values

  • Data manipulation: Here, the main aim is to make the data structure fit some specific rule, which will let the user employ...

Left arrow icon Right arrow icon

Key benefits

  • 54 useful and practical tasks to improve working systems
  • Includes optimizing performance and reliability or uptime, reporting, system management tools, interfacing to standard data ports, and so on
  • Offers 10-15 real-life, practical improvements for each user type

Description

The requirement of handling complex datasets, performing unprecedented statistical analysis, and providing real-time visualizations to businesses has concerned statisticians and analysts across the globe. RStudio is a useful and powerful tool for statistical analysis that harnesses the power of R for computational statistics, visualization, and data science, in an integrated development environment. This book is a collection of recipes that will help you learn and understand RStudio features so that you can effectively perform statistical analysis and reporting, code editing, and R development. The first few chapters will teach you how to set up your own data analysis project in RStudio, acquire data from different data sources, and manipulate and clean data for analysis and visualization purposes. You'll get hands-on with various data visualization methods using ggplot2, and you will create interactive and multidimensional visualizations with D3.js. Additional recipes will help you optimize your code; implement various statistical models to manage large datasets; perform text analysis and predictive analysis; and master time series analysis, machine learning, forecasting; and so on. In the final few chapters, you'll learn how to create reports from your analytical application with the full range of static and dynamic reporting tools that are available in RStudio so that you can effectively communicate results and even transform them into interactive web applications.

Who is this book for?

This book is targeted at R statisticians, data scientists, and R programmers. Readers with R experience who are looking to take the plunge into statistical computing will find this Cookbook particularly indispensable.

What you will learn

  • Familiarize yourself with the latest advanced R console features
  • Create advanced and interactive graphics
  • Manage your R project and project files effectively
  • Perform reproducible statistical analyses in your R projects
  • Use RStudio to design predictive models for a specific domain-based application
  • Use RStudio to effectively communicate your analyses results and even publish them to a blog
  • Put yourself on the frontiers of data science and data monetization in R with all the tools that are needed to effectively communicate your results and even transform your work into a data product
Estimated delivery fee Deliver to Canada

Economy delivery 10 - 13 business days

Can$24.95

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 29, 2016
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781784391034
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Canada

Economy delivery 10 - 13 business days

Can$24.95

Product Details

Publication date : Apr 29, 2016
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781784391034
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Can$ 181.97
Learning Probabilistic Graphical Models in R
Can$49.99
R Machine Learning By Example
Can$69.99
RStudio for R Statistical Computing Cookbook
Can$61.99
Total Can$ 181.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Acquiring Data for Your Project Chevron down icon Chevron up icon
2. Preparing for Analysis – Data Cleansing and Manipulation Chevron down icon Chevron up icon
3. Basic Visualization Techniques Chevron down icon Chevron up icon
4. Advanced and Interactive Visualization Chevron down icon Chevron up icon
5. Power Programming with R Chevron down icon Chevron up icon
6. Domain-specific Applications Chevron down icon Chevron up icon
7. Developing Static Reports Chevron down icon Chevron up icon
8. Dynamic Reporting and Web Application Development Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(2 Ratings)
5 star 0%
4 star 100%
3 star 0%
2 star 0%
1 star 0%
Mr. T. Sep 03, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Andrea's book provides some useful and well explained recipes on how to produce effective and clear data analyses. Personally I particularly appreciated the Dynamic reporting techniques as well as the interactive visualisation opportunities which were unknown to me so far.
Amazon Verified review Amazon
Amazon Customer Jul 20, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The subheading “quick answers to common problem” is the more appropriate description of this book.I am using R for more than fourteen years, in academic and professional environments as well. Even if I’m a kind of “old school” command line coder, I’m continuously intrigued and delighted by the new and fresh packages made available for R. As many of us, from time to time, I felt lost among the packages and in their usage. Andrea’s book can provide some help. It is a real cookbook made up by simple, maybe trivial, ready-to-use recipes. It is the collection of code chunks that everybody should collect: Andrea shared their own.I do not have time to check any update in the visualization techniques, so I appreciated the extensive care on how to communicate the results, with nice graphs, and with some “novelty” as the Sankey or the wordcloud, or via a Shiny app. On the other side, some arguments are just barely and sadly mentioned, in fact they are so extended that deserve entire books; I’m referring to the outliers detection, the parallel computation or the sentiment analysis…This book will not teach you statistics or R. To read and enjoy this book, you need to have at least an average knowledge of R and moreover to had faced some troubles with data analysis and visualization.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela