Latent Dirichlet allocation (LDA) is a generative model, used in the study of natural language, which allows you to extract arguments from a set of source documents and provide a logical explanation on the similarity of individual parts of documents. Each document is considered as a set of words that, when combined, form one or more subsets of latent topics. Each topic is characterized by a particular distribution of terms.
Implementing LDA with scikit-learn
Getting ready
In this recipe, we will use the sklearn.decomposition.LatentDirichletAllocation function to produce a feature matrix of token counts, similar to what the CountVectorizer function (just used in the Building a bag-of-words model recipe of Chapter 7, Analyzing...