You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Oct 2022

Publisher Packt

ISBN-13 9781804611302

Length 386 pages

Edition 2nd Edition

Languages

Python

Tools

Combine

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data

2. Chapter 2: Encoding Categorical Variables FREE CHAPTER

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Replacing categories with ordinal numbers

Ordinal encoding consists of replacing the categories with digits from 1 to k (or 0 to k-1, depending on the implementation), where k is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.

In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and prepare the dataset:

Import pandas and the data split function:

import pandas as pd
from sklearn.model_selection import train_test_split

Let’s load the dataset and divide it into train and test sets:

data = pd.read_csv("credit_approval_uci.csv")
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=["target"], axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

To encode the A7 variable, let’s make a dictionary of category-to-integer pairs:

ordinal_mapping = {k: i for i, k in enumerate(
    X_train["A7"].unique(), 0)
}

If we execute print(ordinal_mapping), we will see the digits that will replace each category:

{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}

Now, let’s replace the categories with numbers in the original variables:

X_train["A7"] = X_train["A7"].map(ordinal_mapping)
X_test["A7"] = X_test["A7"].map(ordinal_mapping)

With print(X_train["A7"].head(10)), we can see the result of the preceding operation, where the original categories were replaced by numbers:

596	0
303	0
204	0
351	1
118	0
247	2
652	0
513	3
230	0
250	4
Name:	A7, dtype: int64

Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in step 2.

Let’s import the required classes:

from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer

Tip

Do not confuse OrdinalEncoder() with LabelEncoder() from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.

Let’s set up the encoder:
```
enc = OrdinalEncoder()
```

Note

Scikit-learn’s OrdinalEncoder() will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s ColumnTransformer().

Let’s make a list containing the categorical variables to encode:

vars_categorical = X_train.select_dtypes(
    include="O").columns.to_list()

Let’s make a list containing the remaining variables:

vars_remainder = X_train.select_dtypes(
    exclude="O").columns.to_list()

Now, let’s set up ColumTransformer() to encode the categorical variables. By setting the remainder parameter to "passthrough", we make ColumnTransformer() concatenate the variables that are not encoded at the back of the encoded features:
```
ct = ColumnTransformer(
    [("encoder", enc, vars_categorical)],
    remainder="passthrough",
)
```
Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
```
ct.fit(X_train)
```

By executing ct.named_transformers_["encoder"].categories_, you can visualize the unique categories per variable.

Now, let’s encode the categorical variables in the train and test sets:
```
X_train_enc = ct.transform(X_train)
X_test_enc = ct.transform(X_test)
```

Remember that scikit-learn returns a NumPy array.

Let’s transform the arrays into pandas DataFrames by adding the columns:

X_train_enc = pd.DataFrame(
    X_train_enc, columns=vars_categorical+vars_remainder)
X_test_enc = pd.DataFrame(
    X_test_enc, columns=vars_categorical+vars_remainder)

Note

Note that, with ColumnTransformer(), the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of step 12 with X_train_enc.head().

Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in step 2.

Let’s import the encoder:

from feature_engine.encoding import OrdinalEncoder

Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in step 7:
```
enc = OrdinalEncoder(encoding_method="arbitrary", variables=vars_categorical)
```

Note

Feature-engine’s OrdinalEncoder automatically finds and encodes all categorical variables if the variables parameter is left set to None. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s OrdinalEncoder() can assign the integers according to the target mean value (see the Performing ordinal encoding based on the target value recipe).

Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
```
enc.fit(X_train)
```

Tip

The category to integer mappings are stored in the encoder_dict_ attribute and can be accessed by executing enc.encoder_dict_.

Finally, let’s encode the categorical variables in the train and test sets:
```
X_train_enc = enc.transform(X_train)
X_test_enc = enc.transform(X_test)
```

Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.

How it works...

In this recipe, we replaced categories with integers assigned arbitrarily.

With pandas unique(), we returned the unique values of the A7 variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the A7 variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas map() to replace the strings in A7 with the integers.

Next, we carried out ordinal encoding using scikit-learn’s OrdinalEncoder() and used ColumnTransformer() to select the columns to encode. With the fit() method, the transformer created the category-to-integer mappings based on the categories in the train set. With the transform() method, the categories were replaced with integers, returning a NumPy array. ColumnTransformer() sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.

To perform ordinal encoding with Feature-engine, we used OrdinalEncoder(), indicating that the integers should be assigned arbitrarily in encoding_method and passing a list with the variables to encode in the variables argument. With the fit() method, the encoder assigned integers to each variable’s categories, which were stored in the encoder_dict_ attribute. These mappings were then used by the transform() method to replace the categories in the train and test sets, returning DataFrames.

There’s more...

You can also carry out ordinal encoding with OrdinalEncoder() from Category Encoders.

The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.

scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, ColumnTransformer(), to slice the data before the transformation.

Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.

Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.

You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Table of Contents (14) Chapters

Replacing categories with ordinal numbers

How to do it...

How it works...

There’s more...

Authors (1)

Personalised recommendations for you