Constructing an N-gram model
Representing a document as a bag of words is useful, but semantics is about more than just words in isolation. To capture word combinations, an n-gram model is useful. Its vocabulary consists of not just words but also word sequences, or n-grams.
We will build a bigram model in this recipe, where bigrams are sequences of two words.
Getting ready
The CountVectorizer
class is very versatile and allows us to construct n-gram models. We will use it in this recipe and test it with a simple classifier.
In this recipe, I make comparisons of the code and its results to the ones in the Putting documents into a bag of words recipe, since the two are very similar, but they have a few differing characteristics.
How to do it…
- Run the simple classifier notebook and import the
CountVectorizer
class:%run -i "../util/util_simple_classifier.ipynb" from sklearn.feature_extraction.text import CountVectorizer
- Create the training and...