A bag-of-words (BoW), is a simplified representation of a text that captures the words that are present in the text and the number of times each word appears in the text. So, for the text string Dogs like cats, but cats do not like dogs, the derived BoW is as follows:
dogs | like | cats | but | do | not |
2 | 2 | 2 | 1 | 1 | 1 |
Here, each word becomes a variable, and the value of the variable represents the number of times the word appears in the string. As you can see, BoW captures multiplicity but does not retain word order or grammar. That is why it is a simple, yet useful, way of extracting features and capturing some information about the texts we are working with.
To capture some syntax, BoW can be used together with n-grams. An n-gram is a contiguous sequence of n items in a given text. Continuing with the sentence Dogs like cats, but...