Handling missing data
The teaching datasets used for examples in previous chapters rarely had the problem of missing data, where a value that should be present is instead absent. The R language uses the special value NA
to indicate these missing values, which cannot be handled natively by most machine learning functions. In Chapter 9, Finding Groups of Data – Clustering with k-means, we were able to replace missing values with a guess of the true value based on other information available in the dataset in a process called imputation. Specifically, the missing age values of high school students were imputed with the average age of students having the same graduation year. This provided a reasonable estimate of the unknown, true age value.
Missing data is a much greater problem in real-world machine learning projects than would be expected given its rarity so far. This is not only due to the fact that real-world projects are messier and more complex than simple textbook examples...