Latent Dirichlet Allocation
In 2003, David Blei, Andrew Ng, and Michael Jordan published their article on the topic modeling algorithm known as Latent Dirichlet Allocation (LDA). LDA is a generative probabilistic model. This means that the modeling process starts with the text and works backward through the process that is assumed to have generated it in order to identify the parameters of interest. In this case, it is the topics that generated the data that are of interest. The process discussed here is the most basic form of LDA, but for learning, it is also the most comprehensible.
There are M documents available for topic modeling within the corpus. Each document can be considered as the sequence of N words, i.e., a sequence (w1,w2… wN).
For each document in the corpus, the assumed generative process is:
- Select N is the number of words and λ is the parameter controlling the Poisson distribution.
- Select is the distribution of topics.
- For each...