Python Feature Engineering Cookbook

Imputing Missing Data

Missing data—meaning the absence of values for certain observations—is an unavoidable problem in most data sources. Some machine learning model implementations can handle missing data out of the box. To train other models, we must remove observations with missing data or transform them into permitted values.

The act of replacing missing data with their statistical estimates is called imputation. The goal of any imputation technique is to produce a complete dataset. There are multiple imputation methods. We select which one to use, depending on whether the data is missing at random, the proportion of missing values, and the machine learning model we intend to use. In this chapter, we will discuss several imputation methods.

This chapter will cover the following recipes:

Removing observations with missing data
Performing mean or median imputation
Imputing categorical variables
Replacing missing values with an arbitrary number
Finding extreme values for imputation
Marking imputed values
Implementing forward and backward fill
Carrying out interpolation
Performing multivariate imputation by chained equations
Estimating missing data with nearest neighbor s

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion of the dataset.

Note

Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important to your m odel.

How to do it...

Let’s begin by making some imports and loading the dataset:

Let’s import pandas, matplotlib, and the train/test split function from scikit-learn:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

Let’s load and display the dataset described in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
data.head()
```
In the following image, we see the first 5 rows of data:

Figure 1.1 – First 5 rows of the dataset

Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and a test set:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.30,
    random_state=42,
)

Let’s now make a bar plot with the proportion of missing data per variable in the training and test sets:

fig, axes = plt.subplots(
    2, 1, figsize=(15, 10), squeeze=False)
X_train.isnull().mean().plot(
    kind='bar', color='grey', ax=axes[0, 0], title="train")
X_test.isnull().mean().plot(
    kind='bar', color='black', ax=axes[1, 0], title="test")
axes[0, 0].set_ylabel('Fraction of NAN')
axes[1, 0].set_ylabel('Fraction of NAN')
plt.show()

The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and test sets (bottom):

Figure 1.2 – Proportion of missing data per variable

Now, we’ll remove observations if they have missing values in any variable:
```
train_cca = X_train.dropna()
test_cca = X_test.dropna()
```

Note

pandas’ dropna()drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this: data.dropna(subset=["A3", "A4"]).

Let’s print and compare the size of the original and complete case datasets:
```
print(f"Total observations: {len(X_train)}")
print(f"Observations without NAN: {len(train_cca)}")
```
We removed more than 200 observations with missing data from the training set, as shown in the following output:
```
Total observations: 483
Observations without NAN: 264
```
After removing observations from the training and test sets, we need to align the target variables:
```
y_train_cca = y_train.loc[train_cca.index]
y_test_cca = y_test.loc[test_cca.index]
```
Now, the datasets and target variables contain the rows without missing data.
To drop observations with missing data utilizing feature-engine, let’s import the required transformer:
```
from feature_engine.imputation import DropMissingData
```
Let’s set up the imputer to automatically find the variables with missing data:
```
cca = DropMissingData(variables=None, missing_only=True)
```
Let’s fit the transformer so that it finds the variables with missing data:
```
cca.fit(X_train)
```
Let’s inspect the variables with NAN that the transformer found:
```
cca.variables_
```
The previous command returns the names of the variables with missing data:
```
['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
```
Let’s remove the rows with missing data in the training and test sets:
```
train_cca = cca.transform(X_train)
test_cca = cca.transform(X_test)
```
Use train_cca.isnull().sum() to corroborate the absence of missing data in the complete case dataset.

DropMissingData can automatically adjust the target after removing missing data from the training set:

train_c, y_train_c = cca.transform_x_y( X_train, y_train)
test_c, y_test_c = cca.transform_x_y(X_test, y_test)

The previous code removed rows with nan from the training and test sets and then re-aligned the target variables.

Note

To remove observations with missing data in a subset of variables, use DropMissingData(variables=['A3', 'A4']). To remove rows with nan in at least 5% of the variables, use DropMissingData(threshold=0.95).

How it works...

In this recipe, we plotted the proportion of missing data in each variable and then removed all observations with missing values.

We used pandas isnull() and mean() methods to determine the proportion of missing observations in each variable. The isnull() method created a Boolean vector per variable with True and False values indicating whether a value was missing. The mean() method took the average of these values and returned the proportion of missing data.

We used pandas plot.bar() to create a bar plot of the fraction of missing data per variable. In Figure 1.2, we saw the fraction of nan per variable in the training and test sets.

To remove observations with missing values in any variable, we used pandas’ dropna(), thereby obtaining a complete case dataset.

Finally, we removed missing data using Feature-engine’s DropMissingData(). This imputer automatically identified and stored the variables with missing data from the train set when we called the fit() method. With the transform() method, the imputer removed observations with nan in those variables. With transform_x_y(), the imputer removed rows with nan from the data sets and then realigned the target variable.

Performing mean or median imputation

Mean or median imputation consists of replacing missing data with the variable’s mean or median value. To avoid data leakage, we determine the mean or median using the train set, and then use these values to impute the train and test sets, and all future data.

Scikit-learn and Feature-engine learn the mean or median from the train set and store these parameters for future use out of the box.

In this recipe, we will perform mean and median imputation using pandas, scikit-learn, and feature-engine.

Note

Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the variable distribution if there is a high percentage of missing data.

How to do it...

Let’s begin this recipe:

First, we’ll import pandas and the required functions and classes from scikit-learn and feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from feature_engine.imputation import MeanMedianImputer

Let’s load the dataset that we prepared in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
```

Let’s split the data into train and test sets with their respective targets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s make a list with the numerical variables by excluding variables of type object:
```
numeric_vars = X_train.select_dtypes(
    exclude="O").columns.to_list()
```
If you execute numeric_vars, you will see the names of the numerical variables: ['A2', 'A3', 'A8', 'A11', 'A14', 'A15'].

Let’s capture the variables’ median values in a dictionary:

median_values = X_train[
    numeric_vars].median().to_dict()

Tip

Note how we calculate the median using the train set. We will use these values to replace missing data in the train and test sets. To calculate the mean, use pandas mean() instead of median().

If you execute median_values, you will see a dictionary with the median value per variable: {'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}.

Let’s replace missing data with the median:
```
X_train_t = X_train.fillna(value=median_values)
X_test_t = X_test.fillna(value=median_values)
```
If you execute X_train_t[numeric_vars].isnull().sum() after the imputation, the number of missing values in the numerical variables should be 0.

Note

pandas fillna() returns a new dataset with imputed values by default. To replace missing data in the original DataFrame, set the inplace parameter to True: X_train.fillna(value=median_values, inplace=True).

Now, let’s impute missing values with the median using scikit-learn.

Let’s set up the imputer to replace missing data with the median:
```
imputer = SimpleImputer(strategy="median")
```

Note

To perform mean imputation, set SimpleImputer() as follows: imputer = SimpleImputer(strategy = "mean").

We restrict the imputation to the numerical variables by using ColumnTransformer():

ct = ColumnTransformer(
    [("imputer", imputer, numeric_vars)],
    remainder="passthrough",
    force_int_remainder_cols=False,
).set_output(transform="pandas")

Note

Scikit-learn can return numpy arrays, pandas DataFrames, or polar frames, depending on how we set out the transform output. By default, it returns numpy arrays.

Let’s fit the imputer to the train set so that it learns the median values:
```
ct.fit(X_train)
```

Let’s check out the learned median values:

ct.named_transformers_.imputer.statistics_

The previous command returns the median values per variable:

array([ 28.835,   2.75,   1.,   0., 160.,   6.])

Let’s replace missing values with the median:

X_train_t = ct.transform(X_train)
X_test_t = ct.transform(X_test)

Let’s display the resulting training set:
```
print(X_train_t.head())
```
We see the resulting DataFrame in the following image:

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Finally, let’s perform median imputation using feature-engine.

Let’s set up the imputer to replace missing data in numerical variables with the median:

imputer = MeanMedianImputer(
    imputation_method="median",
    variables=numeric_vars,
)

Note

To perform mean imputation, change imputation_method to "mean". By default MeanMedianImputer() will impute all numerical variables in the DataFrame, ignoring categorical variables. Use the variables argument to restrict the imputation to a subset of numerical variables.

Fit the imputer so that it learns the median values:
```
imputer.fit(X_train)
```

Inspect the learned medians:

imputer.imputer_dict_

The previous command returns the median values in a dictionary:

{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}

Finally, let’s replace the missing values with the median:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Feature-engine’s MeanMedianImputer() returns a DataFrame. You can check that the imputed variables do not contain missing values using X_train[numeric_vars].isnull().mean().

How it works...

In this recipe, we replaced missing data with the variable’s median values using pandas, scikit-learn, and feature-engine.

We divided the dataset into train and test sets using scikit-learn’s train_test_split() function. The function takes the predictor variables, the target, the fraction of observations to retain in the test set, and a random_state value for reproducibility, as arguments. It returned a train set with 70% of the original observations and a test set with 30% of the original observations. The 70:30 split was done at random.

To impute missing data with pandas, in step 5, we created a dictionary with the numerical variable names as keys and their medians as values. The median values were learned from the training set to avoid data leakage. To replace missing data, we applied pandas’ fillna() to train and test sets, passing the dictionary with the median values per variable as a parameter.

To replace the missing values with the median using scikit-learn, we used SimpleImputer() with the strategy set to "median". To restrict the imputation to numerical variables, we used ColumnTransformer(). With the remainder argument set to passthrough, we made ColumnTransformer() return all the variables seen in the training set in the transformed output; the imputed ones followed by those that were not transformed.

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

In step 8, we set the output of the column transformer to pandas to obtain a DataFrame as a result. By default, ColumnTransformer() returns numpy arrays.

Note

From version 1.4.0, scikit-learn transformers can return numpy arrays, pandas DataFrames, or polar frames as a result of the transform() method.

With fit(), SimpleImputer() learned the median of each numerical variable in the train set and stored them in its statistics_ attribute. With transform(), it replaced the missing values with the medians.

To replace missing values with the median using Feature-engine, we used the MeanMedianImputer() with the imputation_method set to median. To restrict the imputation to a subset of variables, we passed the variable names in a list to the variables parameter. With fit(), the transformer learned and stored the median values per variable in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values, re turning a pandas DataFrame.

Imputing categorical variables

We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets. scikit-learn and feature-engine find and store the frequent categories for the imputation, out of the box.

In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.

How to do it...

To begin, let’s make a few imports and prepare the data:

Let’s import pandas and the required functions and classes from scikit-learn and feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from feature_engine.imputation import CategoricalImputer

Let’s load the dataset that we prepared in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
```

Let’s split the data into train and test sets and their respective targets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s capture the categorical variables in a list:

categorical_vars = X_train.select_dtypes(
    include="O").columns.to_list()

Let’s store the variables’ most frequent categories in a dictionary:

frequent_values = X_train[
    categorical_vars].mode().iloc[0].to_dict()

Let’s replace missing values with the frequent categories:

X_train_t = X_train.fillna(value=frequent_values)
X_test_t = X_test.fillna(value=frequent_values)

Note

fillna() returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executing X_train.fillna(value=frequent_values, inplace=True).

To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string as the values:
```
imputation_dict = {var:
     "no_data" for var in categorical_vars}
```
Now, we can use this dictionary and the code in step 6 to replace missing data.

Note

With pandas value_counts() we can see the string added by the imputation. Try executing, for example, X_train["A1"].value_counts().

Now, let’s impute missing values with the most frequent category using scikit-learn.

Let’s set up the imputer to find the most frequent category per variable:
```
imputer = SimpleImputer(strategy='most_frequent')
```

Note

SimpleImputer() will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.

Let’s restrict the imputation to the categorical variables:

ct = ColumnTransformer(
    [("imputer",imputer, categorical_vars)],
    remainder="passthrough"
    ).set_output(transform="pandas")

Note

To impute missing data with a string instead of the most frequent category, set SimpleImputer() as follows: imputer = SimpleImputer(strategy="constant", fill_value="missing").

Fit the imputer to the train set so that it learns the most frequent values:
```
ct.fit(X_train)
```
Let’s take a look at the most frequent values learned by the imputer:
```
ct.named_transformers_.imputer.statistics_
```
The previous command returns the most frequent values per variable:
```
array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
```
Finally, let’s replace missing values with the frequent categories:
```
X_train_t = ct.transform(X_train)
X_test_t = ct.transform(X_test)
```
Make sure to inspect the resulting DataFrames by executing X_train_t.head().

Note

The ColumnTransformer() changes the names of the variables. The imputed variables show the prefix imputer and the untransformed variables the prefix remainder.

Finally, let’s impute missing values using feature-engine.

Let’s set up the imputer to replace the missing data in categorical variables with their most frequent value:
```
imputer = CategoricalImputer(
    imputation_method="frequent",
    variables=categorical_vars,
)
```

Note

With the variables parameter set to None, CategoricalImputer() will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown in step 13.

Fit the imputer to the train set so that it learns the most frequent categories:
```
imputer.fit(X_train)
```

Note

To impute categorical variables with a specific string, set imputation_method to missing and fill_value to the desired string.

Let’s check out the learned categories:

imputer.imputer_dict_

We can see the dictionary with the most frequent values in the following output:

{'A1': 'b',
 'A4': 'u',
 'A5': 'g',
 'A6': 'c',
 'A7': 'v',
 'A9': 't',
 'A10': 'f',
 'A12': 'f',
 'A13': 'g'}

Finally, let’s replace the missing values with frequent categories:
```
X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)
```
If you want to impute numerical variables with a string or the most frequent value using CategoricalImputer(), set the ignore_format parameter to True.

CategoricalImputer() returns a pandas DataFrame as a result.

How it works...

In this recipe, we replaced missing values in categorical variables with the most frequent categories or an arbitrary string. We used pandas, scikit-learn, and feature-engine.

In step 5, we created a dictionary with the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandas mode(), and to return a dictionary, we used pandas to_dict(). To replace the missing data, we used pandas fillna(), passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values by using .iloc[0].

To replace the missing values using scikit-learn, we used SimpleImputer() with the strategy set to most_frequent. To restrict the imputation to categorical variables, we used ColumnTransformer(). With remainder set to passthrough, we made ColumnTransformer() return all the variables present in the training set as a result of the transform() method .

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

With fit(), SimpleImputer() learned the variables’ most frequent categories and stored them in its statistics_ attribute. With transform(), it replaced the missing data with the learned parameters.

SimpleImputer() and ColumnTransformer() return NumPy arrays by default. We can change this behavior with the set_output() parameter.

To replace missing values with feature-engine, we used the CategoricalImputer() with imputation_method set to frequent. With fit(), the transformer learned and stored the most frequent categories in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values with the learned parameters.

Unlike SimpleImputer(), CategoricalImputer() will only impute categorical variables, unless specifically told not to do so by setting the ignore_format parameter to True. In addition, with feature-engine transformers we can restrict the transformations to a subset of variables through the transformer itself. For scikit-learn transformers, we need the additional ColumnTransformer() class to apply the transformat ion to a subset of the variables.

Replacing missing values with an arbitrary number

We can replace missing data with an arbitrary value. Commonly used values are 999, 9999, or -1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.

When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.

Note

We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data with arbitrary numbers using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by importing the necessary tools and loading the data:

Import pandas and the required functions and classes:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.imputation import ArbitraryNumberImputer

Let’s load the dataset described in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
```

Let’s separate the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

We will select arbitrary values greater than the maximum value of the distribution.

Let’s find the maximum value of four numerical variables:
```
X_train[['A2','A3', 'A8', 'A11']].max()
```
The previous command returns the following output:
```
A2     76.750
A3     26.335
A8     28.500
A11    67.000
dtype: float64
```
We’ll use 99 for the imputation because it is bigger than the maximum values of the numerical variables in step 4.

Let’s make a copy of the original DataFrames:

X_train_t = X_train.copy()
X_test_t = X_test.copy()

Now, we replace the missing values with 99:

X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[
    "A2", "A3", "A8", "A11"]].fillna(99)
X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[
    "A2", "A3", "A8", "A11"]].fillna(99)

Note

To impute different variables with different values using pandas fillna(), use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.

Now, we’ll impute missing values with an arbitrary number using scikit-learn.

Let’s set up imputer to replace missing values with 99:

imputer = SimpleImputer(strategy='constant', fill_value=99)

Note

If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.

Let’s fit imputer to a slice of the train set containing the variables to impute:
```
vars = ["A2", "A3", "A8", "A11"]
imputer.fit(X_train[vars])
```
Replace the missing values with 99 in the desired variables:
```
X_train_t[vars] = imputer.transform(X_train[vars])
X_test_t[vars] = imputer.transform(X_test[vars])
```
Go ahead and check the lack of missing values by executing X_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().
To finish, let’s impute missing values using feature-engine.

Let’s set up the imputer to replace missing values with 99 in 4 specific variables:

imputer = ArbitraryNumberImputer(
    arbitrary_number=99,
    variables=["A2", "A3", "A8", "A11"],
)

Note

ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set the variables parameter to None.

Finally, let’s replace the missing values with 99:

X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

Note

To impute different variables with different numbers, set up ArbitraryNumberImputer() as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).

We have now replaced missing data with arbitrary numbers us ing three different open-source libraries.

How it works...

In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas, scikit-learn, and feature-engine.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max(). We chose 99 because it was greater than the maximum values of the selected variables. In step 5, we used pandas fillna() to replace the missing data.

To replace missing values using scikit-learn, we utilized SimpleImputer(), with the strategy set to constant, and specified 99 in the fill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform().

To replace missing values with feature-engine we used ArbitraryValueImputer(), specifying the value 99 and the variables to impute as parameters. Next, we applied the fit_transform() method to replace missing data in the train set and the transform() met hod to replace missing data in the test set.

Finding extreme values for imputation

Replacing missing values with a value at the end of the variable distribution (extreme values) is like replacing them with an arbitrary value, but instead of setting the arbitrary values manually, the values are automatically selected from the end of the variable distribution.

We can replace missing data with a value that is greater or smaller than most values in the variable. To select a value that is greater, we can use the mean plus a factor of the standard deviation. Alternatively, we can set it to the 75th quantile + IQR × 1.5. IQR stands for inter-quartile range and is the difference between the 75th and 25th quantile. To replace missing data with values that are smaller than the remaining values, we can use the mean minus a factor of the standard deviation, or the 25th quantile – IQR × 1.5.

Note

End-of-tail imputation may distort the distribution of the original variables, so it may not be suitable for linear models.

In this recipe, we will implement end-of-tail or extreme va lue imputation using pandas and feature-engine.

How to do it...

To begin this recipe, let’s import the necessary tools and load the data:

Let’s import pandas and the required function and class:

import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import EndTailImputer

Let’s load the dataset we described in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
```

Let’s capture the numerical variables in a list, excluding the target:

numeric_vars = [
    var for var in data.select_dtypes(
        exclude="O").columns.to_list()
    if var !="target"
]

Let’s split the data into train and test sets, keeping only the numerical variables:

X_train, X_test, y_train, y_test = train_test_split(
    data[numeric_vars],
    data["target"],
    test_size=0.3,
    random_state=0,
)

We’ll now determine the IQR:

IQR = X_train.quantile(0.75) - X_train.quantile(0.25)

We can visualize the IQR values by executing IQR or print(IQR):

A2      16.4200
A3       6.5825
A8       2.8350
A11      3.0000
A14    212.0000
A15    450.0000
dtype: float64

Let’s create a dictionary with the variable names and the imputation values:
```
imputation_dict = (
    X_train.quantile(0.75) + 1.5 * IQR).to_dict()
```

Note

If we use the inter-quartile range proximity rule, we determine the imputation values by adding 1.5 times the IQR to the 75th quantile. If variables are normally distributed, we can calculate the imputation values as the mean plus a factor of the standard deviation, imputation_dict = (X_train.mean() + 3 * X_train.std()).to_dict().

Finally, let’s replace the missing data:

X_train_t = X_train.fillna(value=imputation_dict)
X_test_t = X_test.fillna(value=imputation_dict)

Note

We can also replace missing data with values at the left tail of the distribution using value = X_train[var].quantile(0.25) - 1.5 * IQR or value = X_train[var].mean() – 3 * X_train[var].std().

To finish, let’s impute missing values using feature-engine.

Let’s set up imputer to estimate a value at the right of the distribution using the IQR proximity rule:

imputer = EndTailImputer(
    imputation_method="iqr",
    tail="right",
    fold=3,
    variables=None,
)

Note

To use the mean and standard deviation to calculate the replacement values, set imputation_method="Gaussian". Use left or right in the tail argument to specify the side of the distribution to consider when finding values for the imputation.

Let’s fit EndTailImputer() to the train set so that it learns the values for the imputation:
```
imputer.fit(X_train)
```

Let’s inspect the learned values:

imputer.imputer_dict_

The previous command returns a dictionary with the values to use to impute each variable:

{'A2': 88.18,
 'A3': 27.31,
 'A8': 11.504999999999999,
 'A11': 12.0,
 'A14': 908.0,
 'A15': 1800.0}

Finally, let’s replace the missing values:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Remember that you can corroborate that the missing values were replaced by using X_train[['A2','A3', 'A8', 'A11', 'A14', 'A15']].isnull().mean().

How it works...

In this recipe, we replaced missing values in numerical variables with a number at the end of the distribution using pandas and feature-engine.

We determined the imputation values according to the formulas described in the introduction to this recipe. We used pandas quantile() to find specific quantile values, or pandas mean() and std() for the mean and standard deviation. With pandas fillna() we replaced the missing values.

To replace missing values with EndTailImputer() from feature-engine, we set distribution to iqr to calculate the values based on the IQR proximity rule. With tail set to right the transformer found the imputation values from the right of the distribution. With fit(), the imputer learned and stored the values for the imputation in a dictionary in the imputer_dict_ attribute. With transform(), we replaced the missing values, returning DataFrames.

Marking imputed values

In the previous recipes, we focused on replacing missing data with estimates of their values. In addition, we can add missing indicators to mark observations where values were missing.

A missing indicator is a binary variable that takes the value 1 or True to indicate whether a value was missing, and 0 or False otherwise. It is common practice to replace missing observations with the mean, median, or most frequent category while simultaneously marking those missing observations with missing indicators. In this recipe, we will learn how to add missing indicators using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by making some imports and loading the data:

Let’s import the required libraries, functions, and classes:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from feature_engine.imputation import(
    AddMissingIndicator,
    CategoricalImputer,
    MeanMedianImputer
)

Let’s load and split the dataset described in the Technical requirements section:

data = pd.read_csv("credit_approval_uci.csv")
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s capture the variable names in a list:

varnames = ["A1", "A3", "A4", "A5", "A6", "A7", "A8"]

Let’s create names for the missing indicators and store them in a list:
```
indicators = [f"{var}_na" for var in varnames]
```
If we execute indicators, we will see the names we will use for the new variables: ['A1_na', 'A3_na', 'A4_na', 'A5_na', 'A6_na', 'A7_na', 'A8_na'].

Let’s make a copy of the original DataFrames:

X_train_t = X_train.copy()
X_test_t = X_test.copy()

Let’s add the missing indicators:

X_train_t[indicators] =X_train[
    varnames].isna().astype(int)
X_test_t[indicators] = X_test[
    varnames].isna().astype(int)

Note

If you want the indicators to have True and False as values instead of 0 and 1, remove astype(int) in step 6.

Let’s inspect the resulting DataFrame:
```
X_train_t.head()
```
We can see the newly added variables at the right of the DataFrame in the following image:

Figure 1.4 – DataFrame with the missing indicators

Now, let’s add missing indicators using Feature-engine instead.

Set up the imputer to add binary indicators to every variable with missing data:

imputer = AddMissingIndicator(
    variables=None, missing_only=True
    )

Fit the imputer to the train set so that it finds the variables with missing data:
```
imputer.fit(X_train)
```

Note

If we execute imputer.variables_, we will find the variables for which missing indicators will be added.

Finally, let’s add the missing indicators:
```
X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)
```
So far, we just added missing indicators. But we still have the missing data in our variables. We need to replace them with numbers. In the rest of this recipe, we will combine the use of missing indicators with mean and mode imputation.

Let’s create a pipeline to add missing indicators to categorical and numerical variables, then impute categorical variables with the most frequent category, and numerical variables with the mean:

pipe = Pipeline([
    ("indicators",
        AddMissingIndicator(missing_only=True)),
    ("categorical", CategoricalImputer(
        imputation_method="frequent")),
    ("numerical", MeanMedianImputer()),
])

Note

feature-engine imputers automatically identify numerical or categorical variables. So there is no need to slice the data or pass the variable names as arguments to the transformers in this case.

Let’s add the indicators and impute missing values:

X_train_t = pipe.fit_transform(X_train)
X_test_t = pipe.transform(X_test)

Note

Use X_train_t.isnull().sum() to corroborate that there is no data missing. Execute X_train_t.head() to get a view of the resulting datafame.

Finally, let’s add missing indicators and simultaneously impute numerical and categorical variables with the mean and most frequent categories respectively, utilizing scikit-learn.

Let’s make a list with the names of the numerical and categorical variables:

numvars = X_train.select_dtypes(
    exclude="O").columns.to_list()
catvars = X_train.select_dtypes(
    include="O").columns.to_list()

Let’s set up a pipeline to perform mean and frequent category imputation while marking the missing data:

pipe = ColumnTransformer([
    ("num_imputer", SimpleImputer(
        strategy="mean",
        add_indicator=True),
    numvars),
    ("cat_imputer", SimpleImputer(
        strategy="most_frequent",
        add_indicator=True),
    catvars),
]).set_output(transform="pandas")

Now, let’s carry out the imputation:

X_train_t = pipe.fit_transform(X_train)
X_test_t = pipe.transform(X_test)

Make sure to explore X_train_t.head() to get familiar with the pipeline’s output.

How it works...

To add missing indicators using pandas, we used isna(), which created a new vector assigning the value of True if there was a missing value or False otherwise. We used astype(int) to convert the Boolean vectors into binary vectors with values 1 and 0.

To add a missing indicator with feature-engine, we used AddMissingIndicator(). With fit() the transformer found the variables with missing data. With transform() it appended the missing indicators to the right of the train and test sets.

To sequentially add missing indicators and then replace the nan values with the most frequent category or the mean, we lined up Feature-engine’s AddMissingIndicator(), CategoricalImputer(), and MeanMedianImputer() within a pipeline. The fit() method from the pipeline made the transformers find the variables with nan and calculate the mean of the numerical variables and the mode of the categorical variables. The transform() method from the pipeline made the transformers add the missing indicators and then replace the missing values with their estimates.

Note

Feature-engine transformations return DataFrames respecting the original names and order of the variables. Scikit-learn’s ColumnTransformer(), on the other hand, changes the variable’s names and order in the resulting data.

Finally, we added missing indicators and replaced missing data with the mean and most frequent category using scikit-learn. We lined up two instances of SimpleImputer(), the first to impute data with the mean and the second to impute data with the most frequent category. In both cases, we set the add_indicator parameter to True to add the missing indicators. We wrapped SimpleImputer() with ColumnTransformer() to specifically modify numerical or categorical variables. Then we used the fit() and transform() methods from the pipeline to train the transformers and then add the indicators and replace the missing data.

When returning DataFrames, ColumnTransformer() changes the names of the variables and their order. Take a look at the result from step 15 by executing X_train_t.head(). You’ll see that the name given to each step of the pipeline is added as a prefix to the variables to flag which variable was modified with each transformer. Then, num_imputer__A2 was returned by the first step of the pipeline, while cat_imputer__A12 was returned by the second step of the pipeline.

There’s more…

Scikit-learn has the MissingIndicator() transformer that just adds missing indicators. Check it out in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html and find an example in the accompanying GitHub repository at https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/Recipe-06-Marking-imputed-values.ipynb.

Implementing forward and backward fill

Time series data also show missing values. To impute missing data in time series, we use specific methods. Forward fill imputation involves filling missing values in a dataset with the most recent non-missing value that precedes it in the data sequence. In other words, we carry forward the last seen value to the next valid value. Backward fill imputation involves filling missing values with the next non-missing value that follows it in the data sequence. In other words, we carry the last valid value backward to its preceding valid value.

In this recipe, we will replace missing data in a time series with forward and backward fills.

How to do it...

Let’s begin by importing the required libraries and time series dataset:

Let’s import pandas and matplotlib:

import matplotlib.pyplot as plt
import pandas as pd

Let’s load the air passengers dataset that we described in the Technical requirements section and display the first five rows of the time series:

df = pd.read_csv(
    "air_passengers.csv",
    parse_dates=["ds"],
    index_col=["ds"],
)
print(df.head())

We see the time series in the following output:

                y
ds
1949-01-01  112.0
1949-02-01  118.0
1949-03-01  132.0
1949-04-01  129.0
1949-05-01  121.0

Note

You can determine the percentage of missing data by executing df.isnull().mean().

Let’s plot the time series to spot any obvious data gaps:
```
ax = df.plot(marker=".", figsize=[10, 5], legend=None)
ax.set_title("Air passengers")
ax.set_ylabel("Number of passengers")
ax.set_xlabel("Time")
```
The previous code returns the following plot, where we see intervals of time where data is missing:

Figure 1.5 – Time series data showing missing values

Let’s impute missing data by carrying the last observed value in any interval to the next valid value:
```
df_imputed = df.ffill()
```
You can verify the absence of missing data by executing df_imputed.isnull().sum().
Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
```
ax = df_imputed.plot(
    linestyle="-", marker=".", figsize=[10, 5])
df_imputed[df.isnull()].plot(
    ax=ax, legend=None, marker=".", color="r")
ax.set_title("Air passengers")
ax.set_ylabel("Number of passengers")
ax.set_xlabel("Time")
```
The previous code returns the following plot, where we see the values used to replace nan as dotted lines overlaid in between the continuous time series lines:

Figure 1.6 – Time series data where missing values were replaced by the last seen observations (dotted line)

Alternatively, we can impute missing data using backward fill:
```
df_imputed = df.bfill()
```
If we plot the imputed dataset and overlay the imputation values as we did in step 5, we’ll see the following plot:

Figure 1.7 – Time series data where missing values were replaced by backward fill (dotted line)

Note

The heights of the values used in the imputation are different in Figures 1.6 and 1.7. In Figure 1.6, we carry the last value forward, hence the height is lower. In Figure 1.7, we carry the next value backward, hence the height is higher.

We’ve now obtained complete datasets that we can use for time series analysis and modeling.

How it works...

pandas ffill() takes the last seen value in any temporal gap in a time series and propagates it forward to the next observed value. Hence, in Figure 1.6 we see the dotted overlay corresponding to the imputation values at the height of the last seen observation.

pandas bfill() takes the next valid value in any temporal gap in a time series and propagates it backward to the previously observed value. Hence, in Figure 1.7 we see the dotted overlay corresponding to the imputation values at the height of the next observation in the gap.

By default, ffill() and bfill() will impute all values between valid observations. We can restrict the imputation to a maximum number of data points in any interval by setting a limit, using the limit parameter in both methods. For example, ffill(limit=10) will only replace the first 10 data points in any gap.

Carrying out interpolation

We can impute missing data in time series by using interpolation between two non-missing data points. Interpolation is the estimation of one or more values in a range by means of a function. In linear interpolation, we fit a linear function between the last observed value and the next valid point. In spline interpolation, we fit a low-degree polynomial between the last and next observed values. The idea of using interpolation is to obtain better estimates of the missing data.

In this recipe, we’ll carry out linear and spline interpolation in a time series.

How to do it...

Let’s begin by importing the required libraries and time series dataset.

Let’s import pandas and matplotlib:

import matplotlib.pyplot as plt
import pandas as pd

Let’s load the time series data described in the Technical requirements section:

df = pd.read_csv(
    "air_passengers.csv",
    parse_dates=["ds"],
    index_col=["ds"],
)

Note

You can plot the time series to find data gaps as we did in step 3 of the Implementing forward and backward fill recipe.

Let’s impute the missing data by linear interpolation:
```
df_imputed = df.interpolate(method="linear")
```

Note

If the time intervals between rows are not uniform then the method should be set to time to achieve a linear fit.

You can verify the absence of missing data by executing df_imputed.isnull().sum().

Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
```
ax = df_imputed.plot(
    linestyle="-", marker=".", figsize=[10, 5])
df_imputed[df.isnull()].plot(
    ax=ax, legend=None, marker=".", color="r")
ax.set_title("Air passengers")
ax.set_ylabel("Number of passengers")
ax.set_xlabel("Time")
```
The previous code returns the following plot, where we see the values used to replace nan as dotted lines in between the continuous line of the time series:

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

Alternatively, we can impute missing data by doing spline interpolation. We’ll use a polynomial of the second degree:
```
df_imputed = df.interpolate(method="spline", order=2)
```
If we plot the imputed dataset and overlay the imputation values as we did in step 4, we’ll see the following plot:

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Note

Change the degree of the polynomial used in the interpolation to see how the replacement values vary.

We’ve now obtained complete datasets that we can use for analysis and modeling.

How it works...

pandas interpolate() fills missing values in a range by using an interpolation method. When we set the method to linear, interpolate() treats all data points as equidistant and fits a line between the last and next valid points in an interval with missing data.

Note

If you want to perform linear interpolation, but your data points are not equally distanced, set method to time.

We then performed spline interpolation with a second-degree polynomial by setting method to spline and order to 2.

pandas interpolate() uses scipy.interpolate.interp1d and scipy.interpolate.UnivariateSpline under the hood, and can therefore implement other interpolation methods. Check out pandas documentation for more details at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html.

Performing multivariate imputation by chained equations

Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values. Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables in the dataset. The output of that function is used to replace missing data.

MICE involves the following steps:

First, it performs a simple univariate imputation to every variable with missing data. For example, median imputation.
Next, it selects one specific variable, say, var_1, and sets the missing values back to missing.
It trains a model to predict var_1 using the other variables as input features.
Finally, it replaces the missing values of var_1 with the output of the model.

MICE repeats steps 2 to 4 for each of the remaining variables.

An imputation cycle concludes once all the variables have been modeled. MICE carries out multiple imputation cycles, typically 10. That is, we repeat steps 2 to 4 for each variable 10 times. The idea is that by the end of the cycles, we should have found the best possible estimates of the missing data for each variable.

Note

Multivariate imputation can be a useful alternative to univariate imputation in situations where we don’t want to distort the variable distributions. Multivariate imputation is also useful when we are interested in having good estimates o f the missing data.

In this recipe, we will implement MICE using scikit-learn.

How to do it...

To begin the recipe, let’s import the required libraries and load the data:

Let’s import the required Python libraries, classes, and functions:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge
from sklearn.experimental import (
    enable_iterative_imputer
)
from sklearn.impute import (
    IterativeImputer,
    SimpleImputer
)

Let’s load some numerical variables from the dataset described in the Technical requirements section:

variables = [
    "A2", "A3", "A8", "A11", "A14", "A15", "target"]
data = pd.read_csv(
    "credit_approval_uci.csv",
    usecols=variables)

Let’s divide the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s create a MICE imputer using Bayes regression, specifying the number of iteration cycles and setting random_state for reproducibility:

imputer = IterativeImputer(
    estimator= BayesianRidge(),
    max_iter=10,
    random_state=0,
).set_output(transform="pandas")

Note

IterativeImputer() contains other useful arguments. For example, we can specify the first imputation strategy using the initial_strategy parameter. We can choose from the mean, median, mode, or arbitrary imputation. We can also specify how we want to cycle over the variables, either randomly or from the one with the fewest missing values to the one with the most.

Let’s fit IterativeImputer() so that it trains the estimators to predict the missing values in each variable:
```
imputer.fit(X_train)
```

Note

We can use any regression model to estimate the missing data with IterativeImputer().

Finally, let’s fill in the missing values in both the train and test sets:

X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)

Note

To corroborate the lack of missing data, we can execute X_train_t.isnull().sum().

To wrap up the recipe, let’s impute the variables with a simple univariate imputation method and compare the effect on the variables’ distribution.

Let’s set up scikit-learn’s SimpleImputer() to perform mean imputation, and then transform the datasets:

imputer_simple = SimpleImputer(
    strategy="mean").set_output(transform="pandas")
X_train_s = imputer_simple.fit_transform(X_train)
X_test_s = imputer_simple.transform(X_test)

Let’s now make a histogram of the A3 variable after MICE imputation, followed by a histogram of the same variable after mean imputation:

fig, axes = plt.subplots(
    2, 1, figsize=(10, 10), squeeze=False)
X_test_t["A3"].hist(
    bins=50, ax=axes[0, 0], color="blue")
X_test_s["A3"].hist(
    bins=50, ax=axes[1, 0], color="green")
axes[0, 0].set_ylabel('Number of observations')
axes[1, 0].set_ylabel('Number of observations')
axes[0, 0].set_xlabel('A3')
axes[1, 0].set_xlabel('A3')
axes[0, 0].set_title('MICE')
axes[1, 0].set_title('Mean imputation')
plt.show()

In the following plot, we see that mean imputation distorts the variable distribution, with more observations toward the mean value:

Figure 1.10 – Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

How it works...

In this recipe, we performed multivariate imputation using IterativeImputer() from scikit-learn. When we fit the model, IterativeImputer() carried out the steps that we described in the introduction of the recipe. That is, it imputed all variables with the mean. Then it selected one variable and set its missing values back to missing. And finally, it fitted a Bayes regressor to estimate that variable based on the others. It repeated this procedure for each variable. That was one cycle of imputation. We set it to repeat this process 10 times. By the end of this procedure, IterativeImputer() had one Bayes regressor trained to predict the values of each variable based on the other variables in the dataset. With transform(), it uses the predictions of these Bayes models to impute the missing data.

IterativeImputer() can only impute missing data in numerical variables based on numerical variables. If you want to use categorical variables as input, you need to encode them first. However, keep in mind that it will only carry out regression. Hence it is not suitable to estimate missing data in discrete or categorical variables.

Estimating missing data with nearest neighbors

Imputation with K-Nearest Neighbors (KNN) involves estimating missing values in a dataset by considering the values of their nearest neighbors, where similarity between data points is determined based on a distance metric, such as the Euclidean distance. It assigns the missing value the average of the nearest neighbors’ values, weighted by their distance.

Consider the following data set containing 4 variables (columns) and 11 observations (rows). We want to impute the dark value in the fifth row of the second variable. First, we find the row’s k-nearest neighbors, where k=3 in our example, and they are highlighted by the rectangular boxes (middle panel). Next, we take the average value shown by the closest neighbors for variable 2.

Figure 1.11 – Diagram showing a value to impute (dark box), the three closest rows to the value to impute (square boxes), and the values considered to take the average for the imputation

The value for the imputation is given by (value1 × w1 + value2 × w2 + value3 × w3) / 3, where w1, w2, and w3 are proportional to the distance of the neighbo r to the data to impute.

In this recipe, we will perform KNN imputation using scikit-learn.

How to do it...

To proceed with the recipe, let’s import the required libraries and prepare the data:

Let’s import the required libraries, classes, and functions:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer

Let’s load the dataset described in the Technical requirements section (only some numerical variables):

variables = [
    "A2", "A3", "A8", "A11", "A14", "A15", "target"]
data = pd.read_csv(
    "credit_approval_uci.csv",
    usecols=variables,
)

Let’s divide the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s set up the imputer to replace missing data with the weighted mean of its closest five neighbors:
```
imputer = KNNImputer(
    n_neighbors=5, weights="distance",
).set_output(transform="pandas")
```

Note

The replacement values can be calculated as the uniform mean of the k-nearest neighbors, by setting weights to uniform or as the weighted average, as we do in the recipe. The weight is based on the distance of the neighbor to the observation to impute. The nearest neighbors carry more weight.

Find the nearest neighbors:
```
imputer.fit(X_train)
```
Replace the missing values with the weighted mean of the values shown by the neighbors:
```
X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)
```

The result is a pandas DataFrame with the missing data replaced.

How it works...

In this recipe, we replaced missing data with the average value shown by each observation’s k-nearest neighbors. We set up KNNImputer() to find each observation’s five closest neighbors based on the Euclidean distance. The replacement values were estimated as the weighted average of the values shown by the five closest neighbors for the variable to impute. With transform(), the imputer calculated the replacement value and replaced the missing data.

Python Feature Engineering Cookbook: A complete guide to crafting powerful features for your machine learning models , Third Edition

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

About the author

FAQs