Topic modeling with Latent Dirichlet allocation in Spark 2.0
In this recipe, we will be demonstrating topic model generation by utilizing Latent Dirichlet Allocation to infer topics from a collection of documents.
We have covered LDA in previous chapters as it applies to clustering and topic modelling, but in this chapter, we demonstrate a more elaborate example to show its application to text analytics using more real-life and complex datasets.
We also apply NLP techniques such as stemming and stop words to provide a more realistic approach to LDA problem-solving. What we are trying to do is to discover a set of latent factors (that is, different from the original) that can solve and describe the solution in a more efficient way in a reduced computational space.
The first question that always comes up when using LDA and topic modelling is what is Dirichlet? Dirichlet is simply a type of distribution and nothing more. Please see the following link from the University of Minnesota for details...