From the preceding analysis, we can safely conclude that, if we want to figure out whether a document was from the rec.autos newsgroup, the presence or absence of words such as car, doors, and bumper can be very useful features. The presence or not of a word is a boolean variable, and we can also propose looking at the count of certain words. For instance, car occurs multiple times in the document. Maybe the more times such a word is found in a text, the more likely it is that the document has something to do with cars.
Thinking about features for text data
Counting the occurrence of each word token
It seems that we are only interested in the occurrence of certain words, their count, or a related measure and not in the order...