In the last chapter, we converted unstructured text data into a numerical format using the bag-of-words model. This model abstracts from word order and represents documents as word vectors, where each entry represents the relevance of a token to the document.
The resulting document-term matrix (DTM), (you may also come across the transposed term-document matrix) is useful to compare documents to each other or to a query vector based on their token content, and quickly find a needle in a haystack or classify documents accordingly.
However, this document model is both high-dimensional and very sparse. As a result, it does little to summarize the content or get closer to understanding what it is about. In this chapter, we will use unsupervised machine learning in the form of topic modeling to extract hidden themes from documents. These themes can produce detailed insights...