When you disregard cases with any missing variables, you lose useful information that the non-missing values in that case convey. You may sometimes want to impute reasonable values (those that will not skew the results of analysis very much) for the missing values.
Replacing missing values with the mean
Getting ready
Download the missing-data.csv file and store it in your R environment's working directory.
How to do it...
Read data and replace missing values:
> dat <- read.csv("missing-data.csv", na.strings = "")
> dat$Income.imp.mean <- ifelse(is.na(dat$Income), mean(dat$Income, na.rm=TRUE), dat$Income)
After this, all the NA values for Income will be the mean value prior to imputation.
How it works...
The preceding ifelse() function returns the imputed mean value if its first argument is NA. Otherwise, it returns the first argument.
There's more...
You cannot impute the mean when a categorical variable has missing values, so you need a different approach. Even for numeric variables, we might sometimes not want to impute the mean for missing values. We discuss an often-used approach here.
Imputing random values sampled from non-missing values
If you want to impute random values sampled from the non-missing values of the variable, you can use the following two functions:
rand.impute <- function(a) {
missing <- is.na(a)
n.missing <- sum(missing)
a.obs <- a[!missing]
imputed <- a
imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)
return (imputed)
}
random.impute.data.frame <- function(dat, cols) {
nms <- names(dat)
for(col in cols) {
name <- paste(nms[col],".imputed", sep = "")
dat[name] <- rand.impute(dat[,col])
}
dat
}
With these two functions in place, you can use the following to impute random values for both Income and Phone_type:
> dat <- read.csv("missing-data.csv", na.strings="")
> random.impute.data.frame(dat, c(1,2))