In the previous recipes, we have seen that some features are categorical variables (originally represented as either object or category data types). However, most machine learning algorithms work exclusively with numeric data. That is why we need to encode categorical features into a representation compatible with the models.
In this recipe, we cover some popular encoding approaches:
- Label encoding
- One-hot encoding
In label encoding, we replace the categorical value with a numeric value between 0 and # of classes - 1—for example, with three distinct classes, we use {0, 1, 2}.
This is already very similar to the outcome of converting to the category class in pandas . We can access the codes of the categories by running df_cat.education.cat.codes. Additionally, we can recover the mapping by running dict(zip(df_cat.education.cat.codes, df_cat...