Putting documents into a bag of words
A bag of words is the simplest way of representing a text. We treat our text as a collection of documents, where documents are anything from sentences to scientific articles to blog posts or whole books. Since we usually compare different documents to each other or use them in a larger context of other documents, we work with a collection of documents, not just a single document.
The bag of words method uses a “training” text that provides it with a list of words that it should consider. When encoding new sentences, it counts the number of occurrences each word makes in the document, and the final vector includes those counts for each word in the vocabulary. This representation can then be fed into a machine learning algorithm.
The reason this vectorizing method is called a bag of words is that it does not take into account the relationships of words between themselves and only counts the number of occurrences of each word....