Handling missing data and split and surrogate variables
Missing data can be a curse for analysis and prediction. It leads to an inaccurate inference from data. One simple way to handle missing data is to refuse to take missing data in to account by simply ignoring it or removing it from the dataset. This approach seems good, but not in an efficient way. If the number of missing values is less than 5 percent of a total dataset then discarding such data will not affect the whole dataset.
Getting ready
This recipe will familiarize us with using mice
packages for filling missing values.
How to do it...
Perform the following steps in R:
- Find the minimum cross-validation error of the classification tree model:
> install.packages("mice") > install.packages("randomForest") > install.packages("VIM") > t = data.frame(x=c(1:100), y=c(1:100)) > t$x[sample(1:100,10)]=NA > t$y[sample(1:100,20)]=NA > aggr(t)
- Tweaking the
aggr
function...