Thinking about features for text data
From the preceding analysis, we can safely conclude that if we want to figure out whether a document was from the rec.autos
newsgroup, the presence or absence of words such as car
, doors
, and bumper
can be very useful features. The presence or not of a word is a Boolean variable, and we can also look at the count of certain words. For instance, car
occurs multiple times in the document. Maybe the more times such a word is found in a text, the more likely it is that the document has something to do with cars.
Counting the occurrence of each word token
It seems that we are only interested in the occurrence of certain words, their count, or a related measure, and not in the order of the words. We can therefore view a text as a collection of words. This is called the Bag of Words (BoW) model. This is a very basic model but it works pretty well in practice. We can optionally define a more complex model that takes into account the order of words...