Hashing high cardinality features
High cardinality features are qualitative features with many possible values. High cardinality features may appear in many applications, such as a country in a customer database, a phone model in advertising, or vocabulary in NLP applications. High cardinality issues can be manifold: not only may they lead to a very highly dimensional dataset, but they can also evolve as more and more values become available. Indeed, even if the data for the number of countries or vocabulary is arguably quite stable, there are new phone models every week, if not every day.
Hashing is a very popular and useful way to deal with such problems. In this recipe, we’ll see what it is and how to use it in practice on a dataset to predict whether employees will leave a company.
Getting started
Hashing is a very useful trick in computer science in general, and it is widely used in cryptography or blockchain, for example. It is also useful in machine learning...