Performing ordinal encoding based on the target value
In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to k-1 to the last category, where k is the number of distinct categories.
This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.
In this recipe, we will encode categories while following the target value using pandas and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the required Python libraries, functions, and classes:
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s determine the mean target value per category in
A7
, then sort the categories from that with the lowest to that with the highest target value:y_train.groupby(X_train["A7"]).mean().sort_values()
The following is the output of the preceding command:
A7 o 0.000000 ff 0.146341 j 0.200000 dd 0.400000 v 0.418773 bb 0.512821 h 0.603960 n 0.666667 z 0.714286 Missing 1.000000 Name: target, dtype: float64
- Now, let’s repeat the computation in step 3, but this time, let’s retain the ordered category names:
ordered_labels = y_train.groupby( X_train["A7"]).mean().sort_values().index
To display the output of the preceding command, we can execute print(ordered_labels)
:
Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')
- Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in step 4:
ordinal_mapping = { k: i for i, k in enumerate( ordered_labels, 0) }
We can visualize the result of the preceding code by executing print(ordinal_mapping)
:
{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}
- Let’s use the dictionary we created in step 5 to replace the categories in
A7
in the train and test sets, returning the encoded features as new columns:X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping) X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)
Tip
Note that if the test set contains a category not present in the train set, the preceding code will introduce np.nan
.
To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the A7
variable with the target before and after the encoding.
- Let’s plot the mean target response per category of the
A7
variable:y_train.groupby(X_train["A7"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()
We can see the non-monotonic relationship between categories of A7
and the target in the following plot:
Figure 2.7 – Relationship between the categories of A7 and the target
- Let’s plot the mean target value per category in the encoded variable:
y_train.groupby(X_train["A7_enc"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()
The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:
Figure 2.8 – Relationship between A7 and the target after the encoding
Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.
- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder
- Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
ordinal_enc = OrdinalEncoder( encoding_method="ordered", variables=None)
Tip
OrdinalEncoder()
will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.
- Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
ordinal_enc.fit(X_train, y_train)
Tip
When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.
- Finally, let’s replace the categories with numbers in the train and test sets:
X_train_enc = ordinal_enc.transform(X_train) X_test_enc = ordinal_enc.transform(X_test)
Tip
A list of the categorical variables is stored in the variables_
attribute of OrdinalEncoder()
and the dictionaries with the category-to-integer mappings in the encoder_dict_
attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.
Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in step 7 and changing the variable name in the groupby()
method.
How it works...
In this recipe, we replaced the categories with integers according to the target mean.
In the first part of this recipe, we worked with the A7
categorical variable. With pandas groupby()
, we grouped the data based on the categories of A7
, and with pandas mean()
, we determined the mean value of the target for each of the categories of A7
. Next, we ordered the categories with pandas sort_values()
from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas index
, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas map()
in the train and test sets.
Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of A7
using pandas groupby()
, followed by pandas mean()
, as described in the preceding paragraph. We followed up with pandas plot()
to create a plot of category versus target mean value. We added a title and y labels with Matplotlib’s title()
and ylabel()
methods.
To perform the encoding with Feature-engine, we used OrdinalEncoder()
and indicated "ordered"
in the encoding_method
argument. We left the argument variables set to None
so that the encoder automatically detects all categorical variables in the dataset. With the fit()
method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the variables_
and encoder_dict_
attributes, respectively. Finally, using the transform()
method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.
See also
For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.