Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781835883587
Length 396 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Finding extreme values for imputation

Replacing missing values with a value at the end of the variable distribution (extreme values) is like replacing them with an arbitrary value, but instead of setting the arbitrary values manually, the values are automatically selected from the end of the variable distribution.

We can replace missing data with a value that is greater or smaller than most values in the variable. To select a value that is greater, we can use the mean plus a factor of the standard deviation. Alternatively, we can set it to the 75th quantile + IQR × 1.5. IQR stands for inter-quartile range and is the difference between the 75th and 25th quantile. To replace missing data with values that are smaller than the remaining values, we can use the mean minus a factor of the standard deviation, or the 25th quantile – IQR × 1.5.

Note

End-of-tail imputation may distort the distribution of the original variables, so it may not be suitable for linear models.

In this recipe, we will implement end-of-tail or extreme value imputation using pandas and feature-engine.

How to do it...

To begin this recipe, let’s import the necessary tools and load the data:

  1. Let’s import pandas and the required function and class:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from feature_engine.imputation import EndTailImputer
  2. Let’s load the dataset we described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s capture the numerical variables in a list, excluding the target:
    numeric_vars = [
        var for var in data.select_dtypes(
            exclude="O").columns.to_list()
        if var !="target"
    ]
  4. Let’s split the data into train and test sets, keeping only the numerical variables:
    X_train, X_test, y_train, y_test = train_test_split(
        data[numeric_vars],
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  5. We’ll now determine the IQR:
    IQR = X_train.quantile(0.75) - X_train.quantile(0.25)

    We can visualize the IQR values by executing IQR or print(IQR):

    A2      16.4200
    A3       6.5825
    A8       2.8350
    A11      3.0000
    A14    212.0000
    A15    450.0000
    dtype: float64
  6. Let’s create a dictionary with the variable names and the imputation values:
    imputation_dict = (
        X_train.quantile(0.75) + 1.5 * IQR).to_dict()

Note

If we use the inter-quartile range proximity rule, we determine the imputation values by adding 1.5 times the IQR to the 75th quantile. If variables are normally distributed, we can calculate the imputation values as the mean plus a factor of the standard deviation, imputation_dict = (X_train.mean() + 3 * X_train.std()).to_dict().

  1. Finally, let’s replace the missing data:
    X_train_t = X_train.fillna(value=imputation_dict)
    X_test_t = X_test.fillna(value=imputation_dict)

Note

We can also replace missing data with values at the left tail of the distribution using value = X_train[var].quantile(0.25) - 1.5 * IQR or value = X_train[var].mean() – 3 * X_train[var].std().

To finish, let’s impute missing values using feature-engine.

  1. Let’s set up imputer to estimate a value at the right of the distribution using the IQR proximity rule:
    imputer = EndTailImputer(
        imputation_method="iqr",
        tail="right",
        fold=3,
        variables=None,
    )

Note

To use the mean and standard deviation to calculate the replacement values, set imputation_method="Gaussian". Use left or right in the tail argument to specify the side of the distribution to consider when finding values for the imputation.

  1. Let’s fit EndTailImputer() to the train set so that it learns the values for the imputation:
    imputer.fit(X_train)
  2. Let’s inspect the learned values:
    imputer.imputer_dict_

    The previous command returns a dictionary with the values to use to impute each variable:

    {'A2': 88.18,
     'A3': 27.31,
     'A8': 11.504999999999999,
     'A11': 12.0,
     'A14': 908.0,
     'A15': 1800.0}
  3. Finally, let’s replace the missing values:
    X_train = imputer.transform(X_train)
    X_test = imputer.transform(X_test)

Remember that you can corroborate that the missing values were replaced by using X_train[['A2','A3', 'A8', 'A11', 'A14', 'A15']].isnull().mean().

How it works...

In this recipe, we replaced missing values in numerical variables with a number at the end of the distribution using pandas and feature-engine.

We determined the imputation values according to the formulas described in the introduction to this recipe. We used pandas quantile() to find specific quantile values, or pandas mean() and std() for the mean and standard deviation. With pandas fillna() we replaced the missing values.

To replace missing values with EndTailImputer() from feature-engine, we set distribution to iqr to calculate the values based on the IQR proximity rule. With tail set to right the transformer found the imputation values from the right of the distribution. With fit(), the imputer learned and stored the values for the imputation in a dictionary in the imputer_dict_ attribute. With transform(), we replaced the missing values, returning DataFrames.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime