Topic modeling using clustering techniques and label derivation
We’ll start our exploration of topic modeling by looking at some considerations relating to grouping semantically similar documents in general, and then we’ll look at a specific example.
Grouping semantically similar documents
Like most of the machine learning problems we’ve discussed so far, the overall task generally breaks down into two sub-problems, representing the data and performing a task based on the representations. We’ll look at these two sub-problems next.
Representing the data
The data representations we’ve looked at so far were reviewed in Chapter 7. These approaches included the simple bag of words (BoW) variants, term frequency - inverse document frequency (TF-IDF), and newer approaches, including Word2Vec. Word2Vec is based on word vectors, which are vectors that represent words in isolation, without taking into account the context in which they occur. A newer...