Creating features with bag-of-words and n-grams
A Bag-of-Words (BoW) is a simplified representation of a piece of text that captures the words that are present in the text and the number of times each word appears in the text. So, for the text string Dogs like cats, but cats do not like dogs, the derived BoW is as follows:
dogs |
like |
cats |
but |
do |
not |
2 |
2 |
2 |
1 |
1 |
1 |
Figure 11.4 – BoW derived from the sentence “Dogs like cats, but cats do not like dogs”
Here, each word becomes a variable, and the value of the variable represents the number of times the word appears in the string. As you can see, BoW captures multiplicity but does...