Creating features with bag-of-words and n-grams
A Bag-of-Words (BoW) is a simplified representation of a piece of text that captures the words that are present in the text and the number of times each word appears in the text. So, for the text string Dogs like cats, but cats do not like dogs, the derived BoW is as follows:
Figure 11.4 – The BoW derived from the sentence Dogs like cats, but cats do not like dogs
Here, each word becomes a variable, and the value of the variable represents the number of times the word appears in the string. As you can see, the BoW captures multiplicity but does not retain word order or grammar. That is why it is a simple, yet useful way of extracting features and capturing some information about the texts we are working with.
To capture some syntax, BoW can be used together with n-grams. An n-gram is a contiguous sequence of n items in a given text. Continuing with the sentence Dogs like cats, but cats do not like...