Topic modeling (TM) is a technique widely used in mining text from a large collection of documents. These topics can then be used to summarize and organize documents that include the topic terms and their relative weights. The dataset that will be used for this project is just in plain unstructured text format.
We will see how effectively we can use the Latent Dirichlet Allocation (LDA) algorithm for finding useful patterns in the data. We will compare other TM algorithms and the scalability power of LDA. In addition, we will utilize Natural Language Processing (NLP) libraries, such as Stanford NLP.
In a nutshell, we will learn the following topics throughout this end-to-end project:
- Topic modelling and text clustering
- How does LDA algorithm work?
- Topic modeling with LDA, Spark MLlib, and Standard NLP
- Other topic...