Summary
In this chapter, we explored the most popular approaches for missing value imputation and discussed the advantages and disadvantages of each approach. Assigning an overall sample mean is not usually a good approach, particularly when observations with missing values are different from other observations in important ways. We can also substantially reduce our variance. Forward or backward filling allows us to maintain the variance in our data, but it works best when the proximity of observations is meaningful, such as with time series or longitudinal data. In most non-trivial cases, we will want to use a multivariate technique, such as regression, KNN, or random forest imputation.
So far, we haven't touched on the important issue of data leakage and how to create separate training and testing datasets. To avoid data leakage, we need to work with training data independently of the testing data as soon as we begin our feature engineering. We will look at feature engineering...