Dealing with Missing Data: Imputation Strategies
Recall that in Chapter 1, Data Exploration and Cleaning, we encountered a sizable proportion of samples in the dataset (3,021/29,685 = 10.2%) where the value of the PAY_1 feature was missing. This is a problem that needs to be dealt with, because many machine learning algorithms, including the implementations of logistic regression and random forest in scikit-learn, cannot accept input for model training or testing that includes missing values.
Our solution to this problem was to simply discard all the samples that had missing values for PAY_1. However, after discussing this issue with our client, we learned that the missing values of PAY_1 were due to a reporting issue that they are working on correcting. In the near-term, if there is a method available that can enable the inclusion of the accounts with missing PAY_1 information in the model prediction process, it would be preferable. So, we need to consider how we could make predictions for...