Imputing values with regression
We ended the previous section by assigning a group mean to the missing values rather than the overall sample mean. As we discussed, this is useful when the feature that determines the groups is correlated with the feature that has the missing values. Using regression to impute values is conceptually similar to this, but we typically use it when the imputation will be based on two or more features.
Regression imputation replaces a feature's missing values with values predicted by a regression model of correlated features. This particular kind of imputation is known as deterministic regression imputation since the imputed values all lie on the regression line, and no error or randomness is introduced.
One potential drawback of this approach is that it can substantially reduce the variance of the feature with missing values. We can use stochastic regression imputation to address this drawback. We will explore both approaches in this...