Filtering missing data before or during the actual analysis
Let's suppose we want to calculate the mean
of the actual length of flights:
> mean(hflights$ActualElapsedTime) [1] NA
The result is NA
of course, because as identified previously, this variable contains missing values, and almost every R operation with NA
results in NA
. So let's overcome this issue as follows:
> mean(hflights$ActualElapsedTime, na.rm = TRUE) [1] 129.3237 > mean(na.omit(hflights$ActualElapsedTime)) [1] 129.3237
Any performance issues there? Or other means of deciding which method to use?
> library(microbenchmark) > NA.RM <- function() + mean(hflights$ActualElapsedTime, na.rm = TRUE) > NA.OMIT <- function() + mean(na.omit(hflights$ActualElapsedTime)) > microbenchmark(NA.RM(), NA.OMIT()) Unit: milliseconds expr min lq median uq max neval NA.RM() 7.105485 7.231737 7.500382 8.002941 9.850411 100 NA.OMIT() 12.268637 12.471294...