Using random forest for imputation
Random forest is an ensemble learning method. It uses bootstrap aggregating, also known as bagging, to improve model accuracy. It makes predictions by repeatedly taking the mean of multiple trees, yielding progressively better estimates. We will use the MissForest
algorithm in this section, which is an application of the random forest algorithm to find missing value imputation.
MissForest
starts by filling in the median or mode (for continuous or categorical features, respectively) for missing values, then uses random forest to predict values. Using this transformed dataset, with missing values replaced with initial predictions, MissForest
generates new predictions, perhaps replacing the initial prediction with a better one. MissForest
will typically go through at least four iterations of this process.
Running MissForest
is even easier than using the KNN imputer, which we used in the previous section. We will impute values for the same wage...