K-Means topic modeling with BERT
In this recipe, we will use the K-Means algorithm to do unsupervised topic classification, using the BERT embeddings to encode the data. This recipe shares many commonalities with the Clustering sentences using K-Means – unsupervised text classification recipe in Chapter 4.
The K-Means algorithm is used to find similar clusters with any kind of data and is an easy way to see trends in the data. It is frequently used while performing preliminary data analysis to quickly check the different types of data that appear in a dataset. We can use it with text data and encode the data using a sentence transformer model.
Getting ready
We will be using the sklearn.cluster.KMeans
object to do the unsupervised clustering, as well as using HuggingFace sentence transformers
. Both packages are part of the poetry environment.
The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob...