Clustering sentences using K-Means – unsupervised text classification
In this recipe, we will use the BBC news dataset. The dataset contains news pieces sorted by five topics: politics, tech, business, sport, and entertainment. We will apply the unsupervised K-Means algorithm to sort the data into unlabeled classes.
After you read this recipe, you will be able to create your own unsupervised clustering model that will sort data into several classes. You can then later apply it to any text data without having to first label it.
Getting ready
We will use the KMeans
algorithm to create our unsupervised model. It is part of the sklearn
package and is included in the poetry
environment.
The BBC news dataset as we use it here was uploaded by a Hugging Face user, and the link and the dataset might change in time. To avoid any potential issues, you can use the BBC dataset uploaded to the book’s GitHub repository by loading it from the CSV file provided in the data...