Choosing an optimal number of topics
To derive the best value from topic modeling, we must choose the optimal number of topics. This can be achieved using a measure of coherence within the topics. Coherence evaluates the quality of the topics by measuring how semantically similar the top words of a topic are. There are various types of coherence measures; however, most of them are based on the calculation of pairwise word co-occurrence statistics. Higher coherence scores typically mean that the topics are more coherent and semantically meaningful.
In gensim
, we will work with two coherence measures – the cumulative
coherence (Cumass) and C_v
coherence. Cumass calculates the pairwise word co-occurrence statistics between the top words in a topic and returns the sum of these scores. Conversly, C_v compares the top words in a topic to a background corpus of words to estimate coherence. It compares the probability of co-occurrence of the top words in the topic to the probability...