You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Oct 2022

Publisher Packt

ISBN-13 9781804611302

Length 386 pages

Edition 2nd Edition

Languages

Python

Tools

Combine

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data

2. Chapter 2: Encoding Categorical Variables FREE CHAPTER

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Performing ordinal encoding based on the target value

In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to k-1 to the last category, where k is the number of distinct categories.

This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.

In this recipe, we will encode categories while following the target value using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

Import the required Python libraries, functions, and classes:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Let’s load the dataset and divide it into train and test sets:

data = pd.read_csv("credit_approval_uci.csv")
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=["target"], axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s determine the mean target value per category in A7, then sort the categories from that with the lowest to that with the highest target value:
```
y_train.groupby(X_train["A7"]).mean().sort_values()
```

The following is the output of the preceding command:

A7
o          0.000000
ff         0.146341
j          0.200000
dd         0.400000
v          0.418773
bb         0.512821
h          0.603960
n          0.666667
z          0.714286
Missing    1.000000
Name: target, dtype: float64

Now, let’s repeat the computation in step 3, but this time, let’s retain the ordered category names:
```
ordered_labels = y_train.groupby(
    X_train["A7"]).mean().sort_values().index
```

To display the output of the preceding command, we can execute print(ordered_labels):

Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')

Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in step 4:
```
ordinal_mapping = {
    k: i for i, k in enumerate(
        ordered_labels, 0)
}
```

We can visualize the result of the preceding code by executing print(ordinal_mapping):

{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}

Let’s use the dictionary we created in step 5 to replace the categories in A7 in the train and test sets, returning the encoded features as new columns:
```
X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping)
X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)
```

Tip

Note that if the test set contains a category not present in the train set, the preceding code will introduce np.nan.

To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the A7 variable with the target before and after the encoding.

Let’s plot the mean target response per category of the A7 variable:

y_train.groupby(X_train["A7"]).mean().plot()
plt.title("Relationship between A7 and the target")
plt.ylabel("Mean of target")
plt.show()

We can see the non-monotonic relationship between categories of A7 and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target

Let’s plot the mean target value per category in the encoded variable:

y_train.groupby(X_train["A7_enc"]).mean().plot()
plt.title("Relationship between A7 and the target")
plt.ylabel("Mean of target")
plt.show()

The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding

Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

Let’s import the encoder:

from feature_engine.encoding import OrdinalEncoder

Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
```
ordinal_enc = OrdinalEncoder(
    encoding_method="ordered",
    variables=None)
```

Tip

OrdinalEncoder() will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.

Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
```
ordinal_enc.fit(X_train, y_train)
```

Tip

When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Finally, let’s replace the categories with numbers in the train and test sets:

X_train_enc = ordinal_enc.transform(X_train)
X_test_enc = ordinal_enc.transform(X_test)

Tip

A list of the categorical variables is stored in the variables_ attribute of OrdinalEncoder() and the dictionaries with the category-to-integer mappings in the encoder_dict_ attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in step 7 and changing the variable name in the groupby() method.

How it works...

In this recipe, we replaced the categories with integers according to the target mean.

In the first part of this recipe, we worked with the A7 categorical variable. With pandas groupby(), we grouped the data based on the categories of A7, and with pandas mean(), we determined the mean value of the target for each of the categories of A7. Next, we ordered the categories with pandas sort_values() from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas index, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas map() in the train and test sets.

Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of A7 using pandas groupby(), followed by pandas mean(), as described in the preceding paragraph. We followed up with pandas plot() to create a plot of category versus target mean value. We added a title and y labels with Matplotlib’s title() and ylabel() methods.

To perform the encoding with Feature-engine, we used OrdinalEncoder() and indicated "ordered" in the encoding_method argument. We left the argument variables set to None so that the encoder automatically detects all categorical variables in the dataset. With the fit() method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the variables_ and encoder_dict_ attributes, respectively. Finally, using the transform() method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.

You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Table of Contents (14) Chapters

Performing ordinal encoding based on the target value

How to do it...

How it works...

See also

Authors (1)

Personalised recommendations for you