Topic models
Topic models are a powerful method to group documents by their main topics. Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. (Grun and Hornik, 2011) In essence, a document is assigned to a topic based on the distribution of the words in that document, and the other documents in that topic will have roughly the same frequency of words.
The algorithm that we will focus on is Latent Dirichlet Allocation (LDA) with Gibbs sampling, which is probably the most commonly used sampling algorithm. In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no apriori reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. LDA with...