Extracting keywords
In this recipe, we will extract keywords from a text. We will be working with the BBC news dataset that contains news articles. You can learn more about the dataset in Chapter 4, in the recipe titled Clustering sentences using K-Means: unsupervised text classification.
Extracting keywords from text can give us a quick idea about what the article is about and can also serve as a basis for a tagging system, for example, on a website.
For the extraction to work correctly, we need to train a TF-IDF vectorizer that we will use during the extraction phase.
Getting ready
In this recipe, we will use the sklearn
package. It is part of the Poetry environment. You can also install it together with other packages by installing the requirements.txt
file.
The BBC news dataset is available on Hugging Face at https://huggingface.co/datasets/SetFit/bbc-news.
The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook...