Summary
When faced with the task of extracting information from an as yet unseen large collection of documents, topic modeling is a great approach, as it provides insights into the underlying structure of the documents. That is, topic models find word groupings using proximity, not context. In this chapter, we have learned how to apply two of the most common and most effective topic modeling algorithms: latent Dirichlet allocation and non-negative matrix factorization. We should now feel comfortable cleaning raw text documents using several different techniques; techniques that can be utilized in many other modeling scenarios. We continued by learning how to convert the cleaned corpus into the appropriate data structure of per-document raw word counts or word weights by applying bag-of-words models. The main focus of the chapter was fitting the two topic models, including optimizing the number of topics, converting the output to easy-to-interpret tables, and visualizing the results. With...