Replacing categories with ordinal numbers
Ordinal encoding consists of replacing the categories with digits from 1 to k (or 0 to k-1, depending on the implementation), where k is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.
In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and prepare the dataset:
- Import
pandas
and thedata
split
function:import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- To encode the
A7
variable, let’s make a dictionary of category-to-integer pairs:ordinal_mapping = {k: i for i, k in enumerate( X_train["A7"].unique(), 0) }
If we execute print(ordinal_mapping)
, we will see the digits that will replace each category:
{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}
- Now, let’s replace the categories with numbers in the original variables:
X_train["A7"] = X_train["A7"].map(ordinal_mapping) X_test["A7"] = X_test["A7"].map(ordinal_mapping)
With print(X_train["A7"].head(10))
, we can see the result of the preceding operation, where the original categories were replaced by numbers:
596 0 303 0 204 0 351 1 118 0 247 2 652 0 513 3 230 0 250 4 Name: A7, dtype: int64
Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in step 2.
- Let’s import the required classes:
from sklearn.preprocessing import OrdinalEncoder from sklearn.compose import ColumnTransformer
Tip
Do not confuse OrdinalEncoder()
with LabelEncoder()
from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.
- Let’s set up the encoder:
enc = OrdinalEncoder()
Note
Scikit-learn’s OrdinalEncoder()
will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s ColumnTransformer()
.
- Let’s make a list containing the categorical variables to encode:
vars_categorical = X_train.select_dtypes( include="O").columns.to_list()
- Let’s make a list containing the remaining variables:
vars_remainder = X_train.select_dtypes( exclude="O").columns.to_list()
- Now, let’s set up
ColumTransformer()
to encode the categorical variables. By setting theremainder
parameter to"passthrough"
, we makeColumnTransformer()
concatenate the variables that are not encoded at the back of the encoded features:ct = ColumnTransformer( [("encoder", enc, vars_categorical)], remainder="passthrough", )
- Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
ct.fit(X_train)
By executing ct.named_transformers_["encoder"].categories_
, you can visualize the unique categories per variable.
- Now, let’s encode the categorical variables in the train and test sets:
X_train_enc = ct.transform(X_train) X_test_enc = ct.transform(X_test)
Remember that scikit-learn returns a NumPy array.
- Let’s transform the arrays into pandas DataFrames by adding the columns:
X_train_enc = pd.DataFrame( X_train_enc, columns=vars_categorical+vars_remainder) X_test_enc = pd.DataFrame( X_test_enc, columns=vars_categorical+vars_remainder)
Note
Note that, with ColumnTransformer()
, the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of step 12 with X_train_enc.head()
.
Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in step 2.
- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder
- Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in step 7:
enc = OrdinalEncoder(encoding_method="arbitrary", variables=vars_categorical)
Note
Feature-engine’s OrdinalEncoder
automatically finds and encodes all categorical variables if the variables
parameter is left set to None
. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s OrdinalEncoder()
can assign the integers according to the target mean value (see the Performing ordinal encoding based on the target value recipe).
- Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
enc.fit(X_train)
Tip
The category to integer mappings are stored in the encoder_dict_
attribute and can be accessed by executing enc.encoder_dict_
.
- Finally, let’s encode the categorical variables in the train and test sets:
X_train_enc = enc.transform(X_train) X_test_enc = enc.transform(X_test)
Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.
How it works...
In this recipe, we replaced categories with integers assigned arbitrarily.
With pandas unique()
, we returned the unique values of the A7
variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the A7
variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas map()
to replace the strings in A7
with the integers.
Next, we carried out ordinal encoding using scikit-learn’s OrdinalEncoder()
and used ColumnTransformer()
to select the columns to encode. With the fit()
method, the transformer created the category-to-integer mappings based on the categories in the train set. With the transform()
method, the categories were replaced with integers, returning a NumPy array. ColumnTransformer()
sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.
To perform ordinal encoding with Feature-engine, we used OrdinalEncoder()
, indicating that the integers should be assigned arbitrarily in encoding_method
and passing a list with the variables to encode in the variables
argument. With the fit()
method, the encoder assigned integers to each variable’s categories, which were stored in the encoder_dict_
attribute. These mappings were then used by the transform()
method to replace the categories in the train and test sets, returning DataFrames.
There’s more...
You can also carry out ordinal encoding with OrdinalEncoder()
from Category Encoders.
The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.
scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, ColumnTransformer()
, to slice the data before the transformation.
Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.
Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.