Imputing missing values using machine learning models
Beyond replacing missing values using statistical measures such as the mean, median, or percentiles, we can also use machine learning models to impute missing values. This process involves predicting the missing values based on the data available in other fields.
A very popular method is to use the KNN imputation. This involves identifying the k-nearest complete data points (neighbors) that surround the missing values and using the average of the values of these k-nearest data points to replace the missing values:
Figure 9.21: Illustration of KNN using house prices in a neighborhood
The preceding diagram gives a sense of how imputation works, specifically using the KNN algorithm. The price of the house with the question mark can be estimated based on the price of neighboring houses. In this example, we are using two immediate neighboring houses and five neighboring houses (K =2 and K =5, respectively...