Encoding categorical features: one-hot encoding
There are several reasons why we may need to encode features before using them in most machine learning algorithms. First, these algorithms typically require numeric data. Second, when a categorical feature is represented with numbers, for example, 1 for female and 2 for male, we need to encode the values so that they are recognized as categorical. Third, the feature might actually be ordinal, with a discrete number of values that represent some meaningful ranking. Our models need to capture that ranking. Finally, a categorical feature might have a large number of values (known as high cardinality), and we might want our encoding to collapse categories.
We can handle the encoding of features with a limited number of values, say 15 or fewer, with one-hot encoding. We go over one-hot encoding in this recipe and then discuss ordinal encoding in the next recipe. We will look at strategies for handling categorical features with high cardinality...