Datasets come with varying amounts of missing data. When we have abundant data, we sometimes (not always) want to eliminate the cases that have missing values for one or more variables. This recipe applies when we want to eliminate cases that have any missing values, as well as when we want to selectively eliminate cases that have missing values for a specific variable alone.
Removing cases with missing values
Getting ready
Download the missing-data.csv file from the code files for this chapter to your R working directory. Read the data from the missing-data.csv file, while taking care to identify the string used in the input file for missing values. In our file, missing values are shown with empty strings:
> dat <- read.csv("missing-data.csv", na.strings="")
How to do it...
To get a data frame that has only the cases with no missing values for any variable, use the na.omit() function:
> dat.cleaned <- na.omit(dat)
Now dat.cleaned contains only those cases from dat that have no missing values in any of the variables.
How it works...
The na.omit() function internally uses the is.na() function, that allows us to find whether its argument is NA. When applied to a single value, it returns a Boolean value. When applied to a collection, it returns a vector:
> is.na(dat[4,2])
[1] TRUE
> is.na(dat$Income)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[10] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
There's more...
You will sometimes need to do more than just eliminate the cases with any missing values. We discuss some options in this section.
Eliminating cases with NA for selected variables
We might sometimes want to selectively eliminate cases that have NA only for a specific variable. The example data frame has two missing values for Income. To get a data frame with only these two cases removed, use:
> dat.income.cleaned <- dat[!is.na(dat$Income),]
> nrow(dat.income.cleaned)
[1] 25
Finding cases that have no missing values
The complete.cases() function takes a data frame or table as its argument and returns a Boolean vector with TRUE for rows that have no missing values, and FALSE otherwise:
> complete.cases(dat)
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
[10] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Rows 4, 6, 13, and 17 have at least one missing value. Instead of using the na.omit() function, we can do the following as well:
> dat.cleaned <- dat[complete.cases(dat),]
> nrow(dat.cleaned)
[1] 23
Converting specific values to NA
Sometimes, we might know that a specific value in a data frame actually means that the data was not available. For example, in the dat data frame, a value of 0 for Income probably means that the data is missing. We can convert these to NA by a simple assignment:
> dat$Income[dat$Income==0] <- NA
Excluding NA values from computations
Many R functions return NA when some parts of the data they work on are NA. For example, computing the mean or sd on a vector with at least one NA value returns NA as the result. To remove NA from consideration, use the na.rm parameter:
> mean(dat$Income)
[1] NA
> mean(dat$Income, na.rm = TRUE)
[1] 65763.64