Imputing values with regression
We ended the previous recipe by assigning a group mean to missing values rather than the overall sample mean. As we discussed, this is useful when the variable that determines the groups is correlated with the variable that has the missing values. Using regression to impute values is conceptually similiar to this, but we typically use it when the imputation will be based on two or more variables.
Regression imputation replaces a variable’s missing values with values predicted by a regression model of correlated variables. This particular kind of imputation is known as deterministic regression imputation, since the imputed values all lie on the regression line, and no error or randomness is introduced.
One potential drawback of this approach is that it can substantially reduce the variance of the variable with missing values. We can use stochastic regression imputation to address this drawback. We explore both approaches in this recipe...