You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787124479

Length 560 pages

Edition 2nd Edition

Languages

Tools

MongoDB

Concepts

Data Analysis

Authors (3):

Kuntal Ganguly

Shanthi Viswanathan

Viswa Viswanathan

View More author details

Replacing missing values with the mean

When you disregard cases with any missing variables, you lose useful information that the non-missing values in that case convey. You may sometimes want to impute reasonable values (those that will not skew the results of analysis very much) for the missing values.

Getting ready

Download the missing-data.csv file and store it in your R environment's working directory.

How to do it...

Read data and replace missing values:

> dat <- read.csv("missing-data.csv", na.strings = "") 
> dat$Income.imp.mean <- ifelse(is.na(dat$Income),     mean(dat$Income, na.rm=TRUE), dat$Income)

After this, all the NA values for Income will be the mean value prior to imputation.

How it works...

The preceding ifelse() function returns the imputed mean value if its first argument is NA. Otherwise, it returns the first argument.

There's more...

You cannot impute the mean when a categorical variable has missing values, so you need a different approach. Even for numeric variables, we might sometimes not want to impute the mean for missing values. We discuss an often-used approach here.

Imputing random values sampled from non-missing values

If you want to impute random values sampled from the non-missing values of the variable, you can use the following two functions:

rand.impute <- function(a) { 
  missing <- is.na(a) 
  n.missing <- sum(missing) 
  a.obs <- a[!missing] 
  imputed <- a 
  imputed[missing] <- sample (a.obs, n.missing, replace=TRUE) 
  return (imputed) 
} 
 
random.impute.data.frame <- function(dat, cols) { 
  nms <- names(dat) 
  for(col in cols) { 
    name <- paste(nms[col],".imputed", sep = "") 
    dat[name] <- rand.impute(dat[,col]) 
  } 
  dat 
}

With these two functions in place, you can use the following to impute random values for both Income and Phone_type:

> dat <- read.csv("missing-data.csv", na.strings="") 
> random.impute.data.frame(dat, c(1,2))

You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Table of Contents (14) Chapters

Replacing missing values with the mean

Getting ready

How to do it...

How it works...

There's more...

Imputing random values sampled from non-missing values

Authors (3)

Other recommended products

Personalised recommendations for you