Classifying news highlights with topic modeling
In this recipe, we are going to study one of the most interesting tasks in NLP, topic modeling. In this task, the user must find the number of topics given a set of documents. Sometimes, the topics (and the number of topics) are known beforehand and the supervised learning techniques that we have seen in previous chapters can be applied. However, in a typical scenario, topic modeling datasets do not provide ground truth and are therefore unsupervised learning problems.
To achieve this, we will use a pre-trained model from GluonNLP Model Zoo and apply its word embeddings to feed a clustering algorithm, which will yield the clustered topics. We will apply this process to a new dataset: 1 Million News Headlines.
Getting ready
As in previous chapters, in this recipe, we will be using a little bit of matrix operations and linear algebra, but it will not be hard at all.
Furthermore, we will be working with text datasets. Therefore...