Missing Data
As a final note on the use of both XGBoost and SHAP, one valuable trait of both packages is their ability to handle missing values. Recall that in Chapter 1, Data Exploration and Cleaning, we found that some samples in the case study data had missing values for the PAY_1
feature. So far, our approach has been to simply remove these samples from the dataset when building models. This is because, without specifically addressing the missing values in some way, the machine learning models implemented by scikit-learn cannot work with the data. Ignoring them is one approach, although this may not be satisfactory as it involves throwing data away. If it's a very small fraction of the data, this may be fine; however, in general, it's good to be able to know how to deal with missing values.
There are several approaches for imputing missing values of features, such as filling them in with the mean or mode of the non-missing values of that feature, or a randomly selected...