Exploring alternative approaches to encoding categorical features
In the previous chapter, we introduced one-hot encoding as the standard solution for encoding categorical features so that they can be understood by ML algorithms. To recap, one-hot encoding converts categorical variables into several binary columns, where a value of 1 indicates that the row belongs to a certain category, and a value of 0 indicates otherwise.
The biggest drawback of that approach is the quickly expanding dimensionality of our dataset. For example, if we had a feature indicating from which of the US states the observation originates, one-hot encoding of this feature would result in the creation of 50 (or 49 if we dropped the reference value) new columns.
Some other issues with one-hot encoding include:
- Creating that many Boolean features introduces sparsity to the dataset, which decision trees don’t handle well.
- Decision trees’ splitting algorithm treats all the...