Putting documents into a bag of words
A bag of words is the simplest way of representing text. We treat our text as a collection of documents, where documents are anything from sentences to book chapters to whole books. Since we usually compare different documents to each other or use them in a larger context of other documents, typically, we work with a collection of documents, not just a single document.
The bag of words method uses a training text that provides it with a list of words that it should consider. When encoding new sentences, it counts the number of occurrences each word makes in the document, and the final vector includes those counts for each word in the vocabulary. This representation can then be fed into a machine learning algorithm.
The decision of what represents a document lies with the engineer, and in many cases will be obvious. For example, if you are working on classifying tweets as belonging to a particular topic, a single tweet will be your document...