Using KNN imputation
KNN is a popular machine learning technique because it is intuitive, easy to run, and yields good results when there are not a large number of features and observations. For the same reasons, it is often used to impute missing values. As its name suggests, KNN identifies the k observations whose features are most similar to each observation. When it's used to impute missing values, KNN uses the nearest neighbors to determine what fill values to use.
We can use KNN imputation to do the same imputation we did in the previous section on regression imputation:
- Let's start by importing
KNNImputer
from scikit-learn and loading the NLS data again:import pandas as pd import numpy as np from sklearn.impute import KNNImputer nls97 = pd.read_csv("data/nls97b.csv") nls97.set_index("personid", inplace=True)
- Next, we must prepare the features. We collapse degree attainment into three categories – less than...