Bag of Word models have a few less than ideal properties that are worth noting.
The first problem with the Bag of Word models we've previously looked at is that they don't consider the context of the word. They don't really consider the relationships that exist between the words in the document.
A second but related concern is that the assignment of words in the vector space is somewhat arbitrary. Information that might exist about the relation between two words in a corpus vocabulary might not be captured. For example, a model that has learned to process the word alligator can leverage very little of that learning when it comes across the word crocodile, even though both alligators and crocodiles are somewhat similar creatures that share many characteristics (bring on the herpetologist hate mail).
Lastly, because the vocabulary of a corpus can be...