Determining the optimal number of topics
What defines a topic? A topic should be distinctive enough that it can represent a concept and the words associated with the concept. On the other hand, if a topic is a mixed BoW such that the topic is not concrete enough, it is better to separate the topic into two or more topics. As a result, the closeness of words in a topic is an important measure. Words in the same topic are better being close to each other.
In NLP, the metric to measure the closeness of a topic is called the coherence score. In Chapter 5, Cosine Similarity, we learned the cosine similarity that measures the similarities between any two words. The coherence score is the average or median of the word similarities of the top words in a topic. This definition was given by Röder, Both, and Hinneburg (2015) [2]. There are three metrics to compute the coherence score, as outlined here:
- Content Vectors (CV): The default metric of
gensim
- UMass: A more popular...