Imputing categorical variables
We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets. scikit-learn
and feature-engine
find and store the frequent categories for the imputation, out of the box.
In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.
How to do it...
To begin, let’s make a few imports and prepare the data:
- Let’s import
pandas
and the required functions and classes fromscikit-learn
andfeature-engine
:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from feature_engine.imputation import CategoricalImputer
- Let’s load the dataset that we prepared in the Technical requirements section:
data = pd.read_csv("credit_approval_uci.csv")
- Let’s split the data into train and test sets and their respective targets:
X_train, X_test, y_train, y_test = train_test_split( Â Â Â Â data.drop("target", axis=1), Â Â Â Â data["target"], Â Â Â Â test_size=0.3, Â Â Â Â random_state=0, )
- Let’s capture the categorical variables in a list:
categorical_vars = X_train.select_dtypes( Â Â Â Â include="O").columns.to_list()
- Let’s store the variables’ most frequent categories in a dictionary:
frequent_values = X_train[ Â Â Â Â categorical_vars].mode().iloc[0].to_dict()
- Let’s replace missing values with the frequent categories:
X_train_t = X_train.fillna(value=frequent_values) X_test_t = X_test.fillna(value=frequent_values)
Note
fillna()
returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executing X_train.fillna(value=frequent_values, inplace=True)
.
- To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string as the values:
imputation_dict = {var: Â Â Â Â Â "no_data" for var in categorical_vars}
Now, we can use this dictionary and the code in step 6 to replace missing data.
Note
With pandas
value_counts()
we can see the string added by the imputation. Try executing, for example, X_train["A1"].value_counts()
.
Now, let’s impute missing values with the most frequent category using scikit-learn
.
- Let’s set up the imputer to find the most frequent category per variable:
imputer = SimpleImputer(strategy='most_frequent')
Note
SimpleImputer()
will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.
- Let’s restrict the imputation to the categorical variables:
ct = ColumnTransformer( Â Â Â Â [("imputer",imputer, categorical_vars)], Â Â Â Â remainder="passthrough" Â Â Â Â ).set_output(transform="pandas")
Note
To impute missing data with a string instead of the most frequent category, set SimpleImputer()
as follows: imputer =
SimpleImputer(strategy="constant", fill_value="missing")
.
- Fit the imputer to the train set so that it learns the most frequent values:
ct.fit(X_train)
- Let’s take a look at the most frequent values learned by the imputer:
ct.named_transformers_.imputer.statistics_
The previous command returns the most frequent values per variable:
array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
- Finally, let’s replace missing values with the frequent categories:
X_train_t = ct.transform(X_train) X_test_t = ct.transform(X_test)
Make sure to inspect the resulting DataFrames by executing
X_train_t.head()
.
Note
The ColumnTransformer()
changes the names of the variables. The imputed variables show the prefix imputer
and the untransformed variables the prefix remainder
.
Finally, let’s impute missing values using feature-engine
.
- Let’s set up the imputer to replace the missing data in categorical variables with their most frequent value:
imputer = CategoricalImputer( Â Â Â Â imputation_method="frequent", Â Â Â Â variables=categorical_vars, )
Note
With the variables
parameter set to None
, CategoricalImputer()
will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown in step 13.
- Fit the imputer to the train set so that it learns the most frequent categories:
imputer.fit(X_train)
Note
To impute categorical variables with a specific string, set imputation_method
to missing
and fill_value
to the desired string.
- Let’s check out the learned categories:
imputer.imputer_dict_
We can see the dictionary with the most frequent values in the following output:
{'A1': 'b', 'A4': 'u', 'A5': 'g', 'A6': 'c', 'A7': 'v', 'A9': 't', 'A10': 'f', 'A12': 'f', 'A13': 'g'}
- Finally, let’s replace the missing values with frequent categories:
X_train_t = imputer.transform(X_train) X_test_t = imputer.transform(X_test)
If you want to impute numerical variables with a string or the most frequent value using
CategoricalImputer()
, set theignore_format
parameter toTrue
.
CategoricalImputer()
returns a pandas DataFrame as a result.
How it works...
In this recipe, we replaced missing values in categorical variables with the most frequent categories or an arbitrary string. We used pandas
, scikit-learn
, and feature-engine
.
In step 5, we created a dictionary with the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandas mode()
, and to return a dictionary, we used pandas to_dict()
. To replace the missing data, we used pandas
fillna()
, passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values by using .iloc[0]
.
To replace the missing values using scikit-learn
, we used SimpleImputer()
with the strategy
set to most_frequent
. To restrict the imputation to categorical variables, we used ColumnTransformer()
. With remainder
set to passthrough
, we made ColumnTransformer()
return all the variables present in the training set as a result of the transform()
method .
Note
ColumnTransformer()
changes the names of the variables in the output. The transformed variables show the prefix imputer
and the unchanged variables show the prefix remainder
.
With fit()
, SimpleImputer()
learned the variables’ most frequent categories and stored them in its statistics_
attribute. With transform()
, it replaced the missing data with the learned parameters.
SimpleImputer()
and ColumnTransformer()
return NumPy arrays by default. We can change this behavior with the set_output()
parameter.
To replace missing values with feature-engine
, we used the CategoricalImputer()
with imputation_method
set to frequent
. With fit()
, the transformer learned and stored the most frequent categories in a dictionary in its imputer_dict_
attribute. With transform()
, it replaced the missing values with the learned parameters.
Unlike SimpleImputer()
, CategoricalImputer()
will only impute categorical variables, unless specifically told not to do so by setting the ignore_format
parameter to True
. In addition, with feature-engine
transformers we can restrict the transformations to a subset of variables through the transformer itself. For scikit-learn
transformers, we need the additional ColumnTransformer()
class to apply the transformation to a subset of the variables.