Constructing the N-gram model
Representing a document as a bag of words is useful, but semantics is about more than just words in isolation. To capture word combinations, an n-gram model is useful. Its vocabulary consists not just of words, but word sequences, or n-grams. We will build a bigram model in this recipe, where bigrams are sequences of two words.
Getting ready
The CountVectorizer
class is very versatile and allows us to construct n-gram models. We will use it again in this recipe. We will also explore how to build character n-gram models using this class.
How to do it…
Follow these steps:
- Import the
CountVectorizer
class and helper functions from Chapter 1, Learning NLP Basics, from the Putting documents into a bag of words recipe:from sklearn.feature_extraction.text import CountVectorizer from Chapter01.dividing_into_sentences import read_text_file, preprocess_text, divide_into_sentences_nltk from Chapter03.bag_of_words import get_sentences, get_new_sentence_vector...