Similar to categorical features, scikit-learn offers an easy way to encode another common feature type, text features. When working with text features, it is often convenient to encode individual words or phrases as numerical values.
Let's consider a dataset that contains a small corpus of text phrases:
In [1]: sample = [
... 'feature engineering',
... 'feature selection',
... 'feature extraction'
... ]
One of the simplest methods of encoding such data is by word count; for each phrase, we simply count the occurrences of each word within it. In scikit-learn, this is easily done using CountVectorizer, which functions akin to DictVectorizer:
In [2]: from sklearn.feature_extraction.text import CountVectorizer
... vec = CountVectorizer()
... X = vec.fit_transform(sample)
... X
Out[2]: <3x4...