Performing mean or median imputation
Mean or median imputation consists of replacing missing values with the mean or median variable. The mean or median is calculated using a train set, and these values are used to impute missing data in train and test sets, as well as in all future data we intend to use with the machine learning model. Scikit-learn and feature-engine
transformers learn the mean or median from the train set and store these parameters for future use out of the box. In this recipe, we will perform mean and median imputation using pandas
, scikit-learn, and feature-engine
.
Tip
Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the distribution of the original variables if there is a high percentage of missing data.
How to do it...
Let’s begin this recipe:
- First, we’ll import
pandas
and the required functions and classes from scikit-learn andfeature-engine
:import...