Bag-of-words and simple tokenizers
In Chapters 3 and 5, we saw the use of the bag-of-words feature extraction technique. This technique takes the text and counts the number of tokens, which were words in Chapters 3 and 5. It is simple and computationally efficient, but it has a few problems.
When instantiating the bag-of-words tokenizer, we can use several parameters that strongly impact the results, as we did in the following fragment of code in the previous chapters:
# create the feature extractor, i.e., BOW vectorizer # please note the argument - max_features # this argument says that we only want three features # this will illustrate that we can get problems - e.g. noise # when using too few features vectorizer = CountVectorizer(max_features = 3)
The max_features
parameter is a cut-off value that reduces the number of features, but it also can introduce noise where two (or more) distinct sentences have the same feature vector (we saw an example of such a sentence in Chapter...