Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781835883587
Length 396 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Imputing categorical variables

We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets. scikit-learn and feature-engine find and store the frequent categories for the imputation, out of the box.

In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.

How to do it...

To begin, let’s make a few imports and prepare the data:

  1. Let’s import pandas and the required functions and classes from scikit-learn and feature-engine:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from feature_engine.imputation import CategoricalImputer
  2. Let’s load the dataset that we prepared in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets and their respective targets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s capture the categorical variables in a list:
    categorical_vars = X_train.select_dtypes(
        include="O").columns.to_list()
  5. Let’s store the variables’ most frequent categories in a dictionary:
    frequent_values = X_train[
        categorical_vars].mode().iloc[0].to_dict()
  6. Let’s replace missing values with the frequent categories:
    X_train_t = X_train.fillna(value=frequent_values)
    X_test_t = X_test.fillna(value=frequent_values)

Note

fillna() returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executing X_train.fillna(value=frequent_values, inplace=True).

  1. To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string as the values:
    imputation_dict = {var:
         "no_data" for var in categorical_vars}

    Now, we can use this dictionary and the code in step 6 to replace missing data.

Note

With pandas value_counts() we can see the string added by the imputation. Try executing, for example, X_train["A1"].value_counts().

Now, let’s impute missing values with the most frequent category using scikit-learn.

  1. Let’s set up the imputer to find the most frequent category per variable:
    imputer = SimpleImputer(strategy='most_frequent')

Note

SimpleImputer() will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.

  1. Let’s restrict the imputation to the categorical variables:
    ct = ColumnTransformer(
        [("imputer",imputer, categorical_vars)],
        remainder="passthrough"
        ).set_output(transform="pandas")

Note

To impute missing data with a string instead of the most frequent category, set SimpleImputer() as follows: imputer = SimpleImputer(strategy="constant", fill_value="missing").

  1. Fit the imputer to the train set so that it learns the most frequent values:
    ct.fit(X_train)
  2. Let’s take a look at the most frequent values learned by the imputer:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the most frequent values per variable:

    array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
  3. Finally, let’s replace missing values with the frequent categories:
    X_train_t = ct.transform(X_train)
    X_test_t = ct.transform(X_test)

    Make sure to inspect the resulting DataFrames by executing X_train_t.head().

Note

The ColumnTransformer() changes the names of the variables. The imputed variables show the prefix imputer and the untransformed variables the prefix remainder.

Finally, let’s impute missing values using feature-engine.

  1. Let’s set up the imputer to replace the missing data in categorical variables with their most frequent value:
    imputer = CategoricalImputer(
        imputation_method="frequent",
        variables=categorical_vars,
    )

Note

With the variables parameter set to None, CategoricalImputer() will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown in step 13.

  1. Fit the imputer to the train set so that it learns the most frequent categories:
    imputer.fit(X_train)

Note

To impute categorical variables with a specific string, set imputation_method to missing and fill_value to the desired string.

  1. Let’s check out the learned categories:
    imputer.imputer_dict_

    We can see the dictionary with the most frequent values in the following output:

    {'A1': 'b',
     'A4': 'u',
     'A5': 'g',
     'A6': 'c',
     'A7': 'v',
     'A9': 't',
     'A10': 'f',
     'A12': 'f',
     'A13': 'g'}
  2. Finally, let’s replace the missing values with frequent categories:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

    If you want to impute numerical variables with a string or the most frequent value using CategoricalImputer(), set the ignore_format parameter to True.

CategoricalImputer() returns a pandas DataFrame as a result.

How it works...

In this recipe, we replaced missing values in categorical variables with the most frequent categories or an arbitrary string. We used pandas, scikit-learn, and feature-engine.

In step 5, we created a dictionary with the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandas mode(), and to return a dictionary, we used pandas to_dict(). To replace the missing data, we used pandas fillna(), passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values by using .iloc[0].

To replace the missing values using scikit-learn, we used SimpleImputer() with the strategy set to most_frequent. To restrict the imputation to categorical variables, we used ColumnTransformer(). With remainder set to passthrough, we made ColumnTransformer() return all the variables present in the training set as a result of the transform() method .

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

With fit(), SimpleImputer() learned the variables’ most frequent categories and stored them in its statistics_ attribute. With transform(), it replaced the missing data with the learned parameters.

SimpleImputer() and ColumnTransformer() return NumPy arrays by default. We can change this behavior with the set_output() parameter.

To replace missing values with feature-engine, we used the CategoricalImputer() with imputation_method set to frequent. With fit(), the transformer learned and stored the most frequent categories in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values with the learned parameters.

Unlike SimpleImputer(), CategoricalImputer() will only impute categorical variables, unless specifically told not to do so by setting the ignore_format parameter to True. In addition, with feature-engine transformers we can restrict the transformations to a subset of variables through the transformer itself. For scikit-learn transformers, we need the additional ColumnTransformer() class to apply the transformation to a subset of the variables.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image