Bag-of-words
A simple yet effective way of classifying text is to see the text as a bag-of-words. This means that we do not care for the order in which words appear in the text, instead we only care about which words appear in the text.
One of the ways of doing a bag-of-words classification is by simply counting the occurrences of different words from within a text. This is done with a so-called count vector. Each word has an index, and for each text, the value of the count vector at that index is the number of occurrences of the word that belong to the index.
Picture this as an example: the count vector for the text "I see cats and dogs and elephants" could look like this:
i |
see |
cats |
and |
dogs |
elephants |
---|---|---|---|---|---|
1 |
1 |
1 |
2 |
1 |
1 |
In reality, count vectors are pretty sparse. There are about 23,000 different words in our text corpus, so it makes sense to limit the number of words we want to include in our count vectors. This could mean excluding words that are often just gibberish or typos with...