You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781835883587

Length 396 pages

Edition 3rd Edition

Languages

Python

Tools

Combine

Concepts

Data Science

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data FREE CHAPTER

2. Chapter 2: Encoding Categorical Variables

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Imputing categorical variables

We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets. scikit-learn and feature-engine find and store the frequent categories for the imputation, out of the box.

In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.

How to do it...

To begin, let’s make a few imports and prepare the data:

Let’s import pandas and the required functions and classes from scikit-learn and feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from feature_engine.imputation import CategoricalImputer

Let’s load the dataset that we prepared in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
```

Let’s split the data into train and test sets and their respective targets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s capture the categorical variables in a list:

categorical_vars = X_train.select_dtypes(
    include="O").columns.to_list()

Let’s store the variables’ most frequent categories in a dictionary:

frequent_values = X_train[
    categorical_vars].mode().iloc[0].to_dict()

Let’s replace missing values with the frequent categories:

X_train_t = X_train.fillna(value=frequent_values)
X_test_t = X_test.fillna(value=frequent_values)

Note

fillna() returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executing X_train.fillna(value=frequent_values, inplace=True).

To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string as the values:
```
imputation_dict = {var:
     "no_data" for var in categorical_vars}
```
Now, we can use this dictionary and the code in step 6 to replace missing data.

Note

With pandas value_counts() we can see the string added by the imputation. Try executing, for example, X_train["A1"].value_counts().

Now, let’s impute missing values with the most frequent category using scikit-learn.

Let’s set up the imputer to find the most frequent category per variable:
```
imputer = SimpleImputer(strategy='most_frequent')
```

Note

SimpleImputer() will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.

Let’s restrict the imputation to the categorical variables:

ct = ColumnTransformer(
    [("imputer",imputer, categorical_vars)],
    remainder="passthrough"
    ).set_output(transform="pandas")

Note

To impute missing data with a string instead of the most frequent category, set SimpleImputer() as follows: imputer = SimpleImputer(strategy="constant", fill_value="missing").

Fit the imputer to the train set so that it learns the most frequent values:
```
ct.fit(X_train)
```
Let’s take a look at the most frequent values learned by the imputer:
```
ct.named_transformers_.imputer.statistics_
```
The previous command returns the most frequent values per variable:
```
array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
```
Finally, let’s replace missing values with the frequent categories:
```
X_train_t = ct.transform(X_train)
X_test_t = ct.transform(X_test)
```
Make sure to inspect the resulting DataFrames by executing X_train_t.head().

Note

The ColumnTransformer() changes the names of the variables. The imputed variables show the prefix imputer and the untransformed variables the prefix remainder.

Finally, let’s impute missing values using feature-engine.

Let’s set up the imputer to replace the missing data in categorical variables with their most frequent value:
```
imputer = CategoricalImputer(
    imputation_method="frequent",
    variables=categorical_vars,
)
```

Note

With the variables parameter set to None, CategoricalImputer() will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown in step 13.

Fit the imputer to the train set so that it learns the most frequent categories:
```
imputer.fit(X_train)
```

Note

To impute categorical variables with a specific string, set imputation_method to missing and fill_value to the desired string.

Let’s check out the learned categories:

imputer.imputer_dict_

We can see the dictionary with the most frequent values in the following output:

{'A1': 'b',
 'A4': 'u',
 'A5': 'g',
 'A6': 'c',
 'A7': 'v',
 'A9': 't',
 'A10': 'f',
 'A12': 'f',
 'A13': 'g'}

Finally, let’s replace the missing values with frequent categories:
```
X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)
```
If you want to impute numerical variables with a string or the most frequent value using CategoricalImputer(), set the ignore_format parameter to True.

CategoricalImputer() returns a pandas DataFrame as a result.

How it works...

In this recipe, we replaced missing values in categorical variables with the most frequent categories or an arbitrary string. We used pandas, scikit-learn, and feature-engine.

In step 5, we created a dictionary with the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandas mode(), and to return a dictionary, we used pandas to_dict(). To replace the missing data, we used pandas fillna(), passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values by using .iloc[0].

To replace the missing values using scikit-learn, we used SimpleImputer() with the strategy set to most_frequent. To restrict the imputation to categorical variables, we used ColumnTransformer(). With remainder set to passthrough, we made ColumnTransformer() return all the variables present in the training set as a result of the transform() method .

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

With fit(), SimpleImputer() learned the variables’ most frequent categories and stored them in its statistics_ attribute. With transform(), it replaced the missing data with the learned parameters.

SimpleImputer() and ColumnTransformer() return NumPy arrays by default. We can change this behavior with the set_output() parameter.

To replace missing values with feature-engine, we used the CategoricalImputer() with imputation_method set to frequent. With fit(), the transformer learned and stored the most frequent categories in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values with the learned parameters.

Unlike SimpleImputer(), CategoricalImputer() will only impute categorical variables, unless specifically told not to do so by setting the ignore_format parameter to True. In addition, with feature-engine transformers we can restrict the transformations to a subset of variables through the transformer itself. For scikit-learn transformers, we need the additional ColumnTransformer() class to apply the transformation to a subset of the variables.

You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Table of Contents (14) Chapters

Imputing categorical variables

How to do it...

How it works...

Authors (1)

Personalised recommendations for you