LDA topic modeling with gensim
Latent Dirichlet Allocation (LDA) is one of the oldest algorithms for topic modeling. It is a statistical generative model that calculates the probabilities of different words. In general, LDA is a good choice of model for longer texts.
We will use one of the main topic modeling algorithms, LDA, to create a topic model for the BBC news texts. We know that the BBC news dataset has five topics: tech, politics, business, entertainment, and sport. Thus, we will use five as the expected number of clusters.
Getting ready
We will be using the gensim
package, which is part of the poetry environment. You can also install the requirements.txt
file to get the package.
The dataset is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/bbc-text.csv and should be downloaded to the data
folder.
The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing...