Introducing the LDA algorithm
In Chapter 3, Classifying Topics of Newsgroup Posts, we examined how to classify the instances of a newsgroup dataset into predefined topics. A related situation is encountered when we want to assign a topic label to a piece of text without prior knowledge of the available topics. Topic modeling refers to the task of identifying groups of items, in our case words, that best describes a collection of documents or sentences. The topics emerge during the specific process; hence they are called latent.
A popular topic modeling technique to extract the hidden topics from a given corpus is the latent dirichlet allocation (LDA). Strictly speaking, LDA is not a clustering algorithm because it produces a distribution of groupings over the sentences being processed. However, as a document can be a part of multiple topics, LDA resembles a soft clustering algorithm in which each data point belongs to more than one cluster. For this reason, we made it part of this...