LDA topic modeling with sklearn
In this recipe, we will use the LDA algorithm to discover topics that appear in the BBC dataset. This algorithm can be thought of as dimensionality reduction, or going from a representation where words are counted (such as how we represent documents using CountVectorizer
or TfidfVectorizer
, see Chapter 3, Representing Text: Capturing Semantics, we instead represent documents as sets of topics, each topic with a weight. The number of topics is of course much smaller than the number of words in the vocabulary. To learn more about how the LDA algorithm works, see https://highdemandskills.com/topic-modeling-intuitive/.
Getting ready
We will use the sklearn
and pandas
packages. If you haven't installed them, do so using the following command:
pip install sklearn pip install pandas
How to do it…
We will use a dataframe to parse in the data, then represent the documents using the CountVectorizer
object, apply the LDA algorithm, and...