You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781835883587

Length 396 pages

Edition 3rd Edition

Languages

Python

Tools

Combine

Concepts

Data Science

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data FREE CHAPTER

2. Chapter 2: Encoding Categorical Variables

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Performing multivariate imputation by chained equations

Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values. Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables in the dataset. The output of that function is used to replace missing data.

MICE involves the following steps:

First, it performs a simple univariate imputation to every variable with missing data. For example, median imputation.
Next, it selects one specific variable, say, var_1, and sets the missing values back to missing.
It trains a model to predict var_1 using the other variables as input features.
Finally, it replaces the missing values of var_1 with the output of the model.

MICE repeats steps 2 to 4 for each of the remaining variables.

An imputation cycle concludes once all the variables have been modeled. MICE carries out multiple imputation cycles, typically 10. That is, we repeat steps 2 to 4 for each variable 10 times. The idea is that by the end of the cycles, we should have found the best possible estimates of the missing data for each variable.

Note

Multivariate imputation can be a useful alternative to univariate imputation in situations where we don’t want to distort the variable distributions. Multivariate imputation is also useful when we are interested in having good estimates of the missing data.

In this recipe, we will implement MICE using scikit-learn.

How to do it...

To begin the recipe, let’s import the required libraries and load the data:

Let’s import the required Python libraries, classes, and functions:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge
from sklearn.experimental import (
    enable_iterative_imputer
)
from sklearn.impute import (
    IterativeImputer,
    SimpleImputer
)

Let’s load some numerical variables from the dataset described in the Technical requirements section:

variables = [
    "A2", "A3", "A8", "A11", "A14", "A15", "target"]
data = pd.read_csv(
    "credit_approval_uci.csv",
    usecols=variables)

Let’s divide the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s create a MICE imputer using Bayes regression, specifying the number of iteration cycles and setting random_state for reproducibility:

imputer = IterativeImputer(
    estimator= BayesianRidge(),
    max_iter=10,
    random_state=0,
).set_output(transform="pandas")

Note

IterativeImputer() contains other useful arguments. For example, we can specify the first imputation strategy using the initial_strategy parameter. We can choose from the mean, median, mode, or arbitrary imputation. We can also specify how we want to cycle over the variables, either randomly or from the one with the fewest missing values to the one with the most.

Let’s fit IterativeImputer() so that it trains the estimators to predict the missing values in each variable:
```
imputer.fit(X_train)
```

Note

We can use any regression model to estimate the missing data with IterativeImputer().

Finally, let’s fill in the missing values in both the train and test sets:

X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)

Note

To corroborate the lack of missing data, we can execute X_train_t.isnull().sum().

To wrap up the recipe, let’s impute the variables with a simple univariate imputation method and compare the effect on the variables’ distribution.

Let’s set up scikit-learn’s SimpleImputer() to perform mean imputation, and then transform the datasets:

imputer_simple = SimpleImputer(
    strategy="mean").set_output(transform="pandas")
X_train_s = imputer_simple.fit_transform(X_train)
X_test_s = imputer_simple.transform(X_test)

Let’s now make a histogram of the A3 variable after MICE imputation, followed by a histogram of the same variable after mean imputation:

fig, axes = plt.subplots(
    2, 1, figsize=(10, 10), squeeze=False)
X_test_t["A3"].hist(
    bins=50, ax=axes[0, 0], color="blue")
X_test_s["A3"].hist(
    bins=50, ax=axes[1, 0], color="green")
axes[0, 0].set_ylabel('Number of observations')
axes[1, 0].set_ylabel('Number of observations')
axes[0, 0].set_xlabel('A3')
axes[1, 0].set_xlabel('A3')
axes[0, 0].set_title('MICE')
axes[1, 0].set_title('Mean imputation')
plt.show()

In the following plot, we see that mean imputation distorts the variable distribution, with more observations toward the mean value:

Figure 1.10 – Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

How it works...

In this recipe, we performed multivariate imputation using IterativeImputer() from scikit-learn. When we fit the model, IterativeImputer() carried out the steps that we described in the introduction of the recipe. That is, it imputed all variables with the mean. Then it selected one variable and set its missing values back to missing. And finally, it fitted a Bayes regressor to estimate that variable based on the others. It repeated this procedure for each variable. That was one cycle of imputation. We set it to repeat this process 10 times. By the end of this procedure, IterativeImputer() had one Bayes regressor trained to predict the values of each variable based on the other variables in the dataset. With transform(), it uses the predictions of these Bayes models to impute the missing data.

IterativeImputer() can only impute missing data in numerical variables based on numerical variables. If you want to use categorical variables as input, you need to encode them first. However, keep in mind that it will only carry out regression. Hence it is not suitable to estimate missing data in discrete or categorical variables.

You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Table of Contents (14) Chapters

Performing multivariate imputation by chained equations

How to do it...

How it works...

See also

Authors (1)

Personalised recommendations for you