One of the most common data types we might encounter while building a machine learning system are categorical features (also known as discrete features), such as the color of a fruit or the name of a company. The challenge with categorical features is that they don't change in a continuous way, which makes it hard to represent them with numbers. For example, a banana is either green or yellow, but not both. A product belongs either in the clothing department or in the books department, but rarely in both, and so on.
How would you go about representing such features?
For example, let's assume we are trying to encode a dataset consisting of a list of forefathers of machine learning and artificial intelligence:
In [1]: data = [
... {'name': 'Alan Turing', 'born': 1912, 'died': 1954},
... ...