Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
RStudio for R Statistical Computing Cookbook
RStudio for R Statistical Computing Cookbook

RStudio for R Statistical Computing Cookbook: Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

Arrow left icon
Profile Icon Andrea Cirillo
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Apr 2016 246 pages 1st Edition
eBook
€8.99 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Andrea Cirillo
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Apr 2016 246 pages 1st Edition
eBook
€8.99 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. €18.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

RStudio for R Statistical Computing Cookbook

Chapter 2. Preparing for Analysis – Data Cleansing and Manipulation

In this chapter, we will cover the following topics:

  • Getting a sense of your data structure with R
  • Preparing your data for analysis with the tidyr package
  • Detecting missing values
  • Substituting missing values by interpolation
  • Detecting and removing outliers
  • Performing data filtering activities

Introduction

Some studies estimate that data preparation activities account for 80 percent of the time invested in data science projects.

I know you will not be surprised reading this number. Data preparation is the phase in data science projects where you take your data from the chaotic world around you and fit it into some precise structures and standards.

This is absolutely not a simple task and involves a great number of techniques that basically let you change the structure of your data and ensure you can work with it.

This chapter will show you recipes that should give you the ability to prepare the data you got from the previous chapter, no matter how it was structured when you acquired it in R.

We will look at the two main activities performed during the data preparation phase:

  • Data cleansing: This involves identification and treatment of outliers and missing values
  • Data manipulation: Here, the main aim is to make the data structure fit some specific rule, which will let the user employ...

Getting a sense of your data structure with R

By following the recipes given in the previous chapter, you got your data. Everything went smoothly, and you may also already have the data as a data frame object.

However, do you know what your data looks like?

Getting to know your data structure is a crucial step within a data analysis project. It will suggest the appropriate treatment and analysis, and will help you avoid error and redundancy in the coding activity that follows.

In this recipe, we will look at a dataset structure by leveraging the describe() function from the Hmisc package. For further preliminary analysis on your data structure, you can also refer to the data visualization recipes in Chapter 3, Basic Visualization Techniques.

Getting ready

This example will be built around a dataset provided in the RStudio project related to this book.

You can download it by authenticating your account at http://packtpub.com.

This dataset is named world_gdp_data.csv and stores GDP values for 248...

Preparing your data for analysis with the tidyr package

The tidyr package is another gift from Hadley Wickham. This package provides functions to make your data tidy.

This means that after applying the tidyr package's function, your data you will be arranged as per the following rules:

  • Each column will contain an attribute
  • Each row will contain an observation
  • Each cell will contain a value

These rules will produce a dataset similar to the following one:

Preparing your data for analysis with the tidyr package

This structure, besides giving you a clearer understanding of your data, will let you work with it more easily.

Furthermore, this structure will let you take full advantage of the inner R-vectorized structure. This recipe will show you how to apply the gather function to a dataset in order to transform a dataset and make it comply with the cited rules.

The employed data frame is in the so-called wide format, where each period of observation is stored in columns, with each column representing a year, as follows:

Preparing your data for analysis with the tidyr package

Getting ready

In order to let...

Detecting and removing missing values

Missing values are values that should have been recorded but, for some reason, weren't actually recorded. Those values are different, from values without meaning, represented in R with NaN (not a number).

Most of us understood missing values due to circumstances such as the following one:

> x <- c(1,2,3,NA,4)
> mean(x)
[1] NA

"Oh come on, I know you can do it. Just ignore that useless NA" was probably your reaction, or at least it was mine.

Fortunately, R comes packed with good functions for missing value detection and handling.

In this recipe and the following one, we will see two opposite approaches to missing value handling:

  • Removing missing values
  • Simulating missing values by interpolation

I have to warn you that removing missing values can be considered right in a really small number of cases, since it compromises the integrity of your data sources and can greatly reduce the reliability of your results.

Nevertheless, if you are...

Substituting missing values using the mice package

Finding and removing missing values in your dataset is not always a viable alternative, for either operative or methodological reasons. It is often preferable to simulate possible values for missing data and integrate those values within the observed data.

This recipe is based on the mice package by Stef van Buuren. It provides an efficient algorithm for missing value substitution based on the multiple imputation technique.

Note

Multiple imputation technique

The multiple imputation technique is a statistical solution to the problem of missing values.

The main idea behind this technique is to draw possible alternative values for each missing value and then, after a proper analysis of simulated values, populating the original dataset with synthetic data.

Getting ready

This recipe requires that you install and load the mice package:

install.packages("mice")
library(mice)

For illustrative purposes, we will use the tidy_gdp data frame created...

Detecting and removing outliers

Outliers are usually dangerous values for data science activities, since they produce heavy distortions within models and algorithms.

Their detection and exclusion is, therefore, a really crucial task.

This recipe will show you how to easily perform this task.

We will compute the I and IV quartiles of a given population and detect values that far from these fixed limits.

You should note that this recipe is feasible only for univariate quantitative population, while different kind of data will require you to use other outlier-detection methods.

How to do it...

  1. Compute the quantiles using the quantile() function:
    quantiles <- quantile(tidy_gdp_complete$gdp, probs = c(.25, .75))
    
  2. Compute the range value using the IQR() function:
    range <- 1.5 * IQR(tidy_gdp_complete$gdp)
    
  3. Subset the original data by excluding the outliers:
    normal_gdp <- subset(tidy_gdp_complete,
    tidy_gdp_complete$gdp > (quantiles[1] - range) & tidy_gdp_complete$gdp < (quantiles[2] + range...

Introduction


Some studies estimate that data preparation activities account for 80 percent of the time invested in data science projects.

I know you will not be surprised reading this number. Data preparation is the phase in data science projects where you take your data from the chaotic world around you and fit it into some precise structures and standards.

This is absolutely not a simple task and involves a great number of techniques that basically let you change the structure of your data and ensure you can work with it.

This chapter will show you recipes that should give you the ability to prepare the data you got from the previous chapter, no matter how it was structured when you acquired it in R.

We will look at the two main activities performed during the data preparation phase:

  • Data cleansing: This involves identification and treatment of outliers and missing values

  • Data manipulation: Here, the main aim is to make the data structure fit some specific rule, which will let the user employ...

Left arrow icon Right arrow icon

Key benefits

  • 54 useful and practical tasks to improve working systems
  • Includes optimizing performance and reliability or uptime, reporting, system management tools, interfacing to standard data ports, and so on
  • Offers 10-15 real-life, practical improvements for each user type

Description

The requirement of handling complex datasets, performing unprecedented statistical analysis, and providing real-time visualizations to businesses has concerned statisticians and analysts across the globe. RStudio is a useful and powerful tool for statistical analysis that harnesses the power of R for computational statistics, visualization, and data science, in an integrated development environment. This book is a collection of recipes that will help you learn and understand RStudio features so that you can effectively perform statistical analysis and reporting, code editing, and R development. The first few chapters will teach you how to set up your own data analysis project in RStudio, acquire data from different data sources, and manipulate and clean data for analysis and visualization purposes. You'll get hands-on with various data visualization methods using ggplot2, and you will create interactive and multidimensional visualizations with D3.js. Additional recipes will help you optimize your code; implement various statistical models to manage large datasets; perform text analysis and predictive analysis; and master time series analysis, machine learning, forecasting; and so on. In the final few chapters, you'll learn how to create reports from your analytical application with the full range of static and dynamic reporting tools that are available in RStudio so that you can effectively communicate results and even transform them into interactive web applications.

Who is this book for?

This book is targeted at R statisticians, data scientists, and R programmers. Readers with R experience who are looking to take the plunge into statistical computing will find this Cookbook particularly indispensable.

What you will learn

  • Familiarize yourself with the latest advanced R console features
  • Create advanced and interactive graphics
  • Manage your R project and project files effectively
  • Perform reproducible statistical analyses in your R projects
  • Use RStudio to design predictive models for a specific domain-based application
  • Use RStudio to effectively communicate your analyses results and even publish them to a blog
  • Put yourself on the frontiers of data science and data monetization in R with all the tools that are needed to effectively communicate your results and even transform your work into a data product

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 29, 2016
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781784391034
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. €18.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Apr 29, 2016
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781784391034
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 108.97
Learning Probabilistic Graphical Models in R
€29.99
R Machine Learning By Example
€41.99
RStudio for R Statistical Computing Cookbook
€36.99
Total 108.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Acquiring Data for Your Project Chevron down icon Chevron up icon
2. Preparing for Analysis – Data Cleansing and Manipulation Chevron down icon Chevron up icon
3. Basic Visualization Techniques Chevron down icon Chevron up icon
4. Advanced and Interactive Visualization Chevron down icon Chevron up icon
5. Power Programming with R Chevron down icon Chevron up icon
6. Domain-specific Applications Chevron down icon Chevron up icon
7. Developing Static Reports Chevron down icon Chevron up icon
8. Dynamic Reporting and Web Application Development Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(2 Ratings)
5 star 0%
4 star 100%
3 star 0%
2 star 0%
1 star 0%
Mr. T. Sep 03, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Andrea's book provides some useful and well explained recipes on how to produce effective and clear data analyses. Personally I particularly appreciated the Dynamic reporting techniques as well as the interactive visualisation opportunities which were unknown to me so far.
Amazon Verified review Amazon
Amazon Customer Jul 20, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The subheading “quick answers to common problem” is the more appropriate description of this book.I am using R for more than fourteen years, in academic and professional environments as well. Even if I’m a kind of “old school” command line coder, I’m continuously intrigued and delighted by the new and fresh packages made available for R. As many of us, from time to time, I felt lost among the packages and in their usage. Andrea’s book can provide some help. It is a real cookbook made up by simple, maybe trivial, ready-to-use recipes. It is the collection of code chunks that everybody should collect: Andrea shared their own.I do not have time to check any update in the visualization techniques, so I appreciated the extensive care on how to communicate the results, with nice graphs, and with some “novelty” as the Sankey or the wordcloud, or via a Shiny app. On the other side, some arguments are just barely and sadly mentioned, in fact they are so extended that deserve entire books; I’m referring to the outliers detection, the parallel computation or the sentiment analysis…This book will not teach you statistics or R. To read and enjoy this book, you need to have at least an average knowledge of R and moreover to had faced some troubles with data analysis and visualization.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.