By-passing missing values
So it seems that missing data relatively frequently occurs with the time-related variables, but we have no missing values among the flight identifiers and dates. On the other hand, if one value is missing for a flight, the chances are rather high that some other variables are missing as well – out of the overall number of 3,622 cases with at least one missing value:
> mean(cor(apply(hflights, 2, function(x) + as.numeric(is.na(x)))), na.rm = TRUE) [1] 0.9589153 Warning message: In cor(apply(hflights, 2, function(x) as.numeric(is.na(x)))) : the standard deviation is zero
Okay, let's see what we have done here! First, we have called the apply
function to transform the values of data.frame
to 0
or 1
, where 0
stands for an observed, while 1
means a missing value. Then we computed the correlation coefficients of this newly created matrix, which of course returned a lot of missing values due to fact that some columns had only one unique value without any variability...