Using random forest for imputation
Random forest is an ensemble learning method, using bootstrap aggregating, also known as bagging, to improve model accuracy. It makes predictions by repeatedly taking the mean of multiple trees, yielding progressively better estimates. We will use the MissForest algorithm in this recipe, which is an application of the random forest algorithm to missing value imputation.
MissForest starts by filling in the median or mode (for continuous or categorical variables respectively) for missing values, then uses random forest to predict values. Using this transformed dataset, with missing values replaced by initial predictions, MissForest generates new predictions, perhaps replacing the initial prediction with a better one. MissForest will typically go through at least 4 iterations of this process.
Getting ready
You will need to install the MissForest
and MiceForest
modules to run the code in this recipe. You can install both with pip
.