Converting categorical features to numerical – one-hot encoding and ordinal encoding
In Chapter 3, Predicting Online Ad Click-Through with Tree-Based Algorithms, I mentioned how one-hot encoding transforms categorical features to numerical features in order to use them in the tree algorithms in scikit-learn and TensorFlow. If we transform categorical features into numerical ones using one-hot encoding, we don’t limit our choice of algorithms to the tree-based ones that can work with categorical features.
The simplest solution we can think of in terms of transforming a categorical feature with k possible values is to map it to a numerical feature with values from 1 to k. For example, [Tech, Fashion, Fashion, Sports, Tech, Tech, Sports]
becomes [1, 2, 2, 3, 1, 1, 3]
. However, this will impose an ordinal characteristic, such as Sports
being greater than Tech
, and a distance property, such as Sports
being closer to Fashion
than to Tech
.
Instead, one-hot encoding...