Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Oct 2022
Publisher Packt
ISBN-13 9781804611302
Length 386 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Performing ordinal encoding based on the target value

In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to k-1 to the last category, where k is the number of distinct categories.

This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.

In this recipe, we will encode categories while following the target value using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python libraries, functions, and classes:
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s determine the mean target value per category in A7, then sort the categories from that with the lowest to that with the highest target value:
    y_train.groupby(X_train["A7"]).mean().sort_values()

The following is the output of the preceding command:

A7
o          0.000000
ff         0.146341
j          0.200000
dd         0.400000
v          0.418773
bb         0.512821
h          0.603960
n          0.666667
z          0.714286
Missing    1.000000
Name: target, dtype: float64
  1. Now, let’s repeat the computation in step 3, but this time, let’s retain the ordered category names:
    ordered_labels = y_train.groupby(
        X_train["A7"]).mean().sort_values().index

To display the output of the preceding command, we can execute print(ordered_labels):

Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')
  1. Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in step 4:
    ordinal_mapping = {
        k: i for i, k in enumerate(
            ordered_labels, 0)
    }

We can visualize the result of the preceding code by executing print(ordinal_mapping):

{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}
  1. Let’s use the dictionary we created in step 5 to replace the categories in A7 in the train and test sets, returning the encoded features as new columns:
    X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping)
    X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)

Tip

Note that if the test set contains a category not present in the train set, the preceding code will introduce np.nan.

To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the A7 variable with the target before and after the encoding.

  1. Let’s plot the mean target response per category of the A7 variable:
    y_train.groupby(X_train["A7"]).mean().plot()
    plt.title("Relationship between A7 and the target")
    plt.ylabel("Mean of target")
    plt.show()

We can see the non-monotonic relationship between categories of A7 and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target

Figure 2.7 – Relationship between the categories of A7 and the target

  1. Let’s plot the mean target value per category in the encoded variable:
    y_train.groupby(X_train["A7_enc"]).mean().plot()
    plt.title("Relationship between A7 and the target")
    plt.ylabel("Mean of target")
    plt.show()

The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding

Figure 2.8 – Relationship between A7 and the target after the encoding

Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import OrdinalEncoder
  2. Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
    ordinal_enc = OrdinalEncoder(
        encoding_method="ordered",
        variables=None)

Tip

OrdinalEncoder() will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.

  1. Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
    ordinal_enc.fit(X_train, y_train)

Tip

When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

  1. Finally, let’s replace the categories with numbers in the train and test sets:
    X_train_enc = ordinal_enc.transform(X_train)
    X_test_enc = ordinal_enc.transform(X_test)

Tip

A list of the categorical variables is stored in the variables_ attribute of OrdinalEncoder() and the dictionaries with the category-to-integer mappings in the encoder_dict_ attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in step 7 and changing the variable name in the groupby() method.

How it works...

In this recipe, we replaced the categories with integers according to the target mean.

In the first part of this recipe, we worked with the A7 categorical variable. With pandas groupby(), we grouped the data based on the categories of A7, and with pandas mean(), we determined the mean value of the target for each of the categories of A7. Next, we ordered the categories with pandas sort_values() from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas index, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas map() in the train and test sets.

Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of A7 using pandas groupby(), followed by pandas mean(), as described in the preceding paragraph. We followed up with pandas plot() to create a plot of category versus target mean value. We added a title and y labels with Matplotlib’s title() and ylabel() methods.

To perform the encoding with Feature-engine, we used OrdinalEncoder() and indicated "ordered" in the encoding_method argument. We left the argument variables set to None so that the encoder automatically detects all categorical variables in the dataset. With the fit() method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the variables_ and encoder_dict_ attributes, respectively. Finally, using the transform() method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.

See also

For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.

You have been reading a chapter from
Python Feature Engineering Cookbook - Second Edition
Published in: Oct 2022
Publisher: Packt
ISBN-13: 9781804611302
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image