The bag-of-words model takes into account isolated terms called unigrams. This looses the order of the words, which can be important in some cases. A generalization of the technique is called n-grams, where we use single words as well as word pairs or word triplets, in the case of bigrams and trigrams, respectively. The n-gram refers to the general case where you keep up to n words together in the data. Naturally this representation exhibits unfavorable combinatorial complexity characteristics and makes the data grow exponentially. When dealing with a large corpus this can take significant computing power.
With the sentence object we created before to exemplify how the tokenization process works (it contains the sentence: If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.) and the build_dfm() function...