In this section, we first introduce how the BoW model converts text data into a numeric vector space representation that permits the comparison of documents using their distance. We then proceed to illustrate how to create a document-term matrix using the sklearn library.
From tokens to numbers – the document-term matrix
The BoW model
The BoW model represents a document based on the frequency of the terms or tokens it contains. Each document becomes a vector with one entry for each token in the vocabulary that reflects the token's relevance to the document.
The document-term matrix is straightforward to compute given the vocabulary. However, it is also a crude simplification because it abstracts from word order...