Finding extreme values for imputation
Replacing missing values with a value at the end of the variable distribution (extreme values) is like replacing them with an arbitrary value, but instead of setting the arbitrary values manually, the values are automatically selected from the end of the variable distribution.
We can replace missing data with a value that is greater or smaller than most values in the variable. To select a value that is greater, we can use the mean plus a factor of the standard deviation. Alternatively, we can set it to the 75th quantile + IQR × 1.5. IQR stands for inter-quartile range and is the difference between the 75th and 25th quantile. To replace missing data with values that are smaller than the remaining values, we can use the mean minus a factor of the standard deviation, or the 25th quantile – IQR × 1.5.
Note
End-of-tail imputation may distort the distribution of the original variables, so it may not be suitable for linear models.
In this recipe, we will implement end-of-tail or extreme value imputation using pandas
and feature-engine
.
How to do it...
To begin this recipe, let’s import the necessary tools and load the data:
- Let’s import
pandas
and the required function and class:import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.imputation import EndTailImputer
- Let’s load the dataset we described in the Technical requirements section:
data = pd.read_csv("credit_approval_uci.csv")
- Let’s capture the numerical variables in a list, excluding the target:
numeric_vars = [ var for var in data.select_dtypes( exclude="O").columns.to_list() if var !="target" ]
- Let’s split the data into train and test sets, keeping only the numerical variables:
X_train, X_test, y_train, y_test = train_test_split( data[numeric_vars], data["target"], test_size=0.3, random_state=0, )
- We’ll now determine the IQR:
IQR = X_train.quantile(0.75) - X_train.quantile(0.25)
We can visualize the IQR values by executing
IQR
orprint(IQR)
:A2 16.4200 A3 6.5825 A8 2.8350 A11 3.0000 A14 212.0000 A15 450.0000 dtype: float64
- Let’s create a dictionary with the variable names and the imputation values:
imputation_dict = ( X_train.quantile(0.75) + 1.5 * IQR).to_dict()
Note
If we use the inter-quartile range proximity rule, we determine the imputation values by adding 1.5 times the IQR to the 75th quantile. If variables are normally distributed, we can calculate the imputation values as the mean plus a factor of the standard deviation, imputation_dict = (X_train.mean() + 3 *
X_train.std()).to_dict()
.
- Finally, let’s replace the missing data:
X_train_t = X_train.fillna(value=imputation_dict) X_test_t = X_test.fillna(value=imputation_dict)
Note
We can also replace missing data with values at the left tail of the distribution using value = X_train[var].quantile(0.25) - 1.5 * IQR
or value = X_train[var].mean() – 3 *
X_train[var].std()
.
To finish, let’s impute missing values using feature-engine
.
- Let’s set up
imputer
to estimate a value at the right of the distribution using the IQR proximity rule:imputer = EndTailImputer( imputation_method="iqr", tail="right", fold=3, variables=None, )
Note
To use the mean and standard deviation to calculate the replacement values, set imputation_method="Gaussian"
. Use left
or right
in the tail
argument to specify the side of the distribution to consider when finding values for the imputation.
- Let’s fit
EndTailImputer()
to the train set so that it learns the values for the imputation:imputer.fit(X_train)
- Let’s inspect the learned values:
imputer.imputer_dict_
The previous command returns a dictionary with the values to use to impute each variable:
{'A2': 88.18, 'A3': 27.31, 'A8': 11.504999999999999, 'A11': 12.0, 'A14': 908.0, 'A15': 1800.0}
- Finally, let’s replace the missing values:
X_train = imputer.transform(X_train) X_test = imputer.transform(X_test)
Remember that you can corroborate that the missing values were replaced by using X_train[['A2','A3', 'A8', 'A11', '
A14', 'A15']].isnull().mean()
.
How it works...
In this recipe, we replaced missing values in numerical variables with a number at the end of the distribution using pandas
and feature-engine
.
We determined the imputation values according to the formulas described in the introduction to this recipe. We used pandas quantile()
to find specific quantile values, or pandas
mean()
and std()
for the mean and standard deviation. With pandas fillna()
we replaced the missing values.
To replace missing values with EndTailImputer()
from feature-engine
, we set distribution
to iqr
to calculate the values based on the IQR proximity rule. With tail
set to right
the transformer found the imputation values from the right of the distribution. With fit()
, the imputer learned and stored the values for the imputation in a dictionary in the imputer_dict_
attribute. With transform()
, we replaced the missing values, returning DataFrames.