Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
R Data Analysis Cookbook, Second Edition

You're reading from   R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Arrow left icon
Product type Paperback
Published in Sep 2017
Publisher Packt
ISBN-13 9781787124479
Length 560 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Kuntal Ganguly Kuntal Ganguly
Author Profile Icon Kuntal Ganguly
Kuntal Ganguly
Shanthi Viswanathan Shanthi Viswanathan
Author Profile Icon Shanthi Viswanathan
Shanthi Viswanathan
Viswa Viswanathan Viswa Viswanathan
Author Profile Icon Viswa Viswanathan
Viswa Viswanathan
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Acquire and Prepare the Ingredients - Your Data FREE CHAPTER 2. What's in There - Exploratory Data Analysis 3. Where Does It Belong? Classification 4. Give Me a Number - Regression 5. Can you Simplify That? Data Reduction Techniques 6. Lessons from History - Time Series Analysis 7. How does it look? - Advanced data visualization 8. This may also interest you - Building Recommendations 9. It's All About Your Connections - Social Network Analysis 10. Put Your Best Foot Forward - Document and Present Your Analysis 11. Work Smarter, Not Harder - Efficient and Elegant R Code 12. Where in the World? Geospatial Analysis 13. Playing Nice - Connecting to Other Systems

Imputing data

Missing values are considered to be the first obstacle in data analysis and predictive modeling. In most statistical analysis methods, list-wise deletion is the default method used to impute missing values, as shown in the earlier recipe. However, these methods are not quite good enough, since deletion could lead to information loss and replacement with simple mean or median, which doesn't take into account the uncertainty in missing values.

Hence, this recipe will show you the multivariate imputation techniques to handle missing values using prediction.

Getting ready

Make sure that the housing-with-missing-value.csv file from the code files of this chapter is in your R working directory.

You should also install the mice package using the following command:

> install.packages("mice")
> library(mice)
> housingData <- read.csv("housing-with-missing-value.csv",header = TRUE, stringsAsFactors = FALSE)

How to do it...

Follow these steps to impute data:

  1. Perform multivariate imputation:
#imputing only two columns having missing values
> columns=c("ptratio","rad")

> imputed_Data <- mice(housingData[,names(housingData) %in% columns], m=5, maxit = 50, method = 'pmm', seed = 500)

>summary(imputed_Data)
  1. Generate complete data:
> completeData <- complete(imputed_Data)
  1. Replace the imputed column values with the housing.csv dataset:
> housingData$ptratio <- completeData$ptratio
> housingData$rad <- completeData$rad
  1. Check for missing values:
> anyNA(housingData)

How it works...

As we already know from our earlier recipe, the housing.csv dataset contains two columns, ptratio and rad, with missing values.

The mice library in R uses a predictive approach and assumes that the missing data is Missing at Random (MAR), and creates multivariate imputations via chained equations to take care of uncertainty in the missing values. It implements the imputation in just two steps: using mice() to build the model and complete() to generate the completed data.

The mice() function takes the following parameters:

  • m: It refers to the number of imputed datasets it creates internally. Default is five.
  • maxit: It refers to the number of iterations taken to impute the missing values.
  • method: It refers to the method used in imputation. The default imputation method (when no argument is specified) depends on the measurement level of the target column and is specified by the defaultMethod argument, where defaultMethod = c("pmm", "logreg", "polyreg", "polr").
  • logreg: Logistic regression (factor column, two levels).
  • polyreg: Polytomous logistic regression (factor column, greater than or equal to two levels).
  • polr: Proportional odds model (ordered column, greater than or equal to two levels).

We have used predictive mean matching (pmm) for this recipe to impute the missing values in the dataset.

The anyNA() function returns a Boolean value to indicate the presence or absence of missing values (NA) in the dataset.

There's more...

Previously, we used the impute() function from the Hmisc library to simply impute the missing value using defined statistical methods (mean, median, and mode). However, Hmisc also has the aregImpute() function that allows mean imputation using additive regression, bootstrapping, and predictive mean matching:

> impute_arg <- aregImpute(~ ptratio + rad , data = housingData, n.impute = 5)

> impute_arg

argImpute() automatically identifies the variable type and treats it accordingly, and the n.impute parameter indicates the number of multiple imputations, where five is recommended.

The output of impute_arg shows R² values for predicted missing values. The higher the value, the better the values predicted.

Check imputed variable values using the following command:

> impute_arg$imputed$rad
You have been reading a chapter from
R Data Analysis Cookbook, Second Edition - Second Edition
Published in: Sep 2017
Publisher: Packt
ISBN-13: 9781787124479
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime