Bag of words feature extraction
Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect
dict
style feature sets, so we must therefore transform our text into a dict
. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.
How to do it...
The idea is to convert a list of words into a dict
, where each word becomes a key with the value True
. The bag_of_words()
function in featx.py
looks like this:
def bag_of_words(words): return dict([(word, True) for word in words])
We can use it with a list of words; in this case, the tokenized sentence the quick brown fox
:
>>> from featx import bag_of_words >>> bag_of_words(['the', 'quick', 'brown', 'fox'...