Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781835883587
Length 396 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Performing multivariate imputation by chained equations

Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values. Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables in the dataset. The output of that function is used to replace missing data.

MICE involves the following steps:

  1. First, it performs a simple univariate imputation to every variable with missing data. For example, median imputation.
  2. Next, it selects one specific variable, say, var_1, and sets the missing values back to missing.
  3. It trains a model to predict var_1 using the other variables as input features.
  4. Finally, it replaces the missing values of var_1 with the output of the model.

MICE repeats steps 2 to 4 for each of the remaining variables.

An imputation cycle concludes once all the variables have been modeled. MICE carries out multiple imputation cycles, typically 10. That is, we repeat steps 2 to 4 for each variable 10 times. The idea is that by the end of the cycles, we should have found the best possible estimates of the missing data for each variable.

Note

Multivariate imputation can be a useful alternative to univariate imputation in situations where we don’t want to distort the variable distributions. Multivariate imputation is also useful when we are interested in having good estimates of the missing data.

In this recipe, we will implement MICE using scikit-learn.

How to do it...

To begin the recipe, let’s import the required libraries and load the data:

  1. Let’s import the required Python libraries, classes, and functions:
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import BayesianRidge
    from sklearn.experimental import (
        enable_iterative_imputer
    )
    from sklearn.impute import (
        IterativeImputer,
        SimpleImputer
    )
  2. Let’s load some numerical variables from the dataset described in the Technical requirements section:
    variables = [
        "A2", "A3", "A8", "A11", "A14", "A15", "target"]
    data = pd.read_csv(
        "credit_approval_uci.csv",
        usecols=variables)
  3. Let’s divide the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s create a MICE imputer using Bayes regression, specifying the number of iteration cycles and setting random_state for reproducibility:
    imputer = IterativeImputer(
        estimator= BayesianRidge(),
        max_iter=10,
        random_state=0,
    ).set_output(transform="pandas")

Note

IterativeImputer() contains other useful arguments. For example, we can specify the first imputation strategy using the initial_strategy parameter. We can choose from the mean, median, mode, or arbitrary imputation. We can also specify how we want to cycle over the variables, either randomly or from the one with the fewest missing values to the one with the most.

  1. Let’s fit IterativeImputer() so that it trains the estimators to predict the missing values in each variable:
    imputer.fit(X_train)

Note

We can use any regression model to estimate the missing data with IterativeImputer().

  1. Finally, let’s fill in the missing values in both the train and test sets:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

Note

To corroborate the lack of missing data, we can execute X_train_t.isnull().sum().

To wrap up the recipe, let’s impute the variables with a simple univariate imputation method and compare the effect on the variables’ distribution.

  1. Let’s set up scikit-learn’s SimpleImputer() to perform mean imputation, and then transform the datasets:
    imputer_simple = SimpleImputer(
        strategy="mean").set_output(transform="pandas")
    X_train_s = imputer_simple.fit_transform(X_train)
    X_test_s = imputer_simple.transform(X_test)
  2. Let’s now make a histogram of the A3 variable after MICE imputation, followed by a histogram of the same variable after mean imputation:
    fig, axes = plt.subplots(
        2, 1, figsize=(10, 10), squeeze=False)
    X_test_t["A3"].hist(
        bins=50, ax=axes[0, 0], color="blue")
    X_test_s["A3"].hist(
        bins=50, ax=axes[1, 0], color="green")
    axes[0, 0].set_ylabel('Number of observations')
    axes[1, 0].set_ylabel('Number of observations')
    axes[0, 0].set_xlabel('A3')
    axes[1, 0].set_xlabel('A3')
    axes[0, 0].set_title('MICE')
    axes[1, 0].set_title('Mean imputation')
    plt.show()

    In the following plot, we see that mean imputation distorts the variable distribution, with more observations toward the mean value:

Figure 1.10 –  Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

Figure 1.10 – Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

How it works...

In this recipe, we performed multivariate imputation using IterativeImputer() from scikit-learn. When we fit the model, IterativeImputer() carried out the steps that we described in the introduction of the recipe. That is, it imputed all variables with the mean. Then it selected one variable and set its missing values back to missing. And finally, it fitted a Bayes regressor to estimate that variable based on the others. It repeated this procedure for each variable. That was one cycle of imputation. We set it to repeat this process 10 times. By the end of this procedure, IterativeImputer() had one Bayes regressor trained to predict the values of each variable based on the other variables in the dataset. With transform(), it uses the predictions of these Bayes models to impute the missing data.

IterativeImputer() can only impute missing data in numerical variables based on numerical variables. If you want to use categorical variables as input, you need to encode them first. However, keep in mind that it will only carry out regression. Hence it is not suitable to estimate missing data in discrete or categorical variables.

See also

To learn more about MICE, take a look at the following resources:

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime