Topic modeling
A final, very useful application of word counting is topic modeling. Given a set of texts, are we able to find clusters of topics? The method to do this is called Latent Dirichlet Allocation (LDA).
Note
Note: The code and data for this section can be found on Kaggle at https://www.kaggle.com/jannesklaas/topic-modeling-with-lda.
While the name is quite a mouth full, the algorithm is a very useful one, so we will look at it step by step. LDA makes the following assumption about how texts are written:
First, a topic distribution is chosen, say 70% machine learning and 30% finance.
Second, the distribution of words for each topic is chosen. For example, the topic "machine learning" might be made up of 20% the word "tensor," 10% the word "gradient," and so on. This means that our topic distribution is a distribution of distributions, also called a Dirichlet distribution.
Once the text gets written, two probabilistic decisions are made for each word: first, a topic is chosen from the...