You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787124479

Length 560 pages

Edition 2nd Edition

Languages

Tools

MongoDB

Concepts

Data Analysis

Authors (3):

Kuntal Ganguly

Shanthi Viswanathan

Viswa Viswanathan

View More author details

Removing cases with missing values

Datasets come with varying amounts of missing data. When we have abundant data, we sometimes (not always) want to eliminate the cases that have missing values for one or more variables. This recipe applies when we want to eliminate cases that have any missing values, as well as when we want to selectively eliminate cases that have missing values for a specific variable alone.

Getting ready

Download the missing-data.csv file from the code files for this chapter to your R working directory. Read the data from the missing-data.csv file, while taking care to identify the string used in the input file for missing values. In our file, missing values are shown with empty strings:

> dat <- read.csv("missing-data.csv", na.strings="")

How to do it...

To get a data frame that has only the cases with no missing values for any variable, use the na.omit() function:

> dat.cleaned <- na.omit(dat)

Now dat.cleaned contains only those cases from dat that have no missing values in any of the variables.

How it works...

The na.omit() function internally uses the is.na() function, that allows us to find whether its argument is NA. When applied to a single value, it returns a Boolean value. When applied to a collection, it returns a vector:

> is.na(dat[4,2]) 
[1] TRUE 
 
> is.na(dat$Income) 
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
[10] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE 
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

There's more...

You will sometimes need to do more than just eliminate the cases with any missing values. We discuss some options in this section.

Eliminating cases with NA for selected variables

We might sometimes want to selectively eliminate cases that have NA only for a specific variable. The example data frame has two missing values for Income. To get a data frame with only these two cases removed, use:

> dat.income.cleaned <- dat[!is.na(dat$Income),] 
> nrow(dat.income.cleaned) 
[1] 25

Finding cases that have no missing values

The complete.cases() function takes a data frame or table as its argument and returns a Boolean vector with TRUE for rows that have no missing values, and FALSE otherwise:

> complete.cases(dat) 
 
 [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE 
[10]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE 
[19]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Rows 4, 6, 13, and 17 have at least one missing value. Instead of using the na.omit() function, we can do the following as well:

> dat.cleaned <- dat[complete.cases(dat),] 
> nrow(dat.cleaned) 
[1] 23

Converting specific values to NA

Sometimes, we might know that a specific value in a data frame actually means that the data was not available. For example, in the dat data frame, a value of 0 for Income probably means that the data is missing. We can convert these to NA by a simple assignment:

> dat$Income[dat$Income==0] <- NA

Excluding NA values from computations

Many R functions return NA when some parts of the data they work on are NA. For example, computing the mean or sd on a vector with at least one NA value returns NA as the result. To remove NA from consideration, use the na.rm parameter:

> mean(dat$Income) 
[1] NA 
 
> mean(dat$Income, na.rm = TRUE) 
[1] 65763.64