K-means topic modeling with BERT
In this recipe, we will use the K-means algorithm to execute unsupervised topic classification, using the BERT embeddings to encode the data. This recipe shares lots of commonalities with the Clustering sentences using K-means: unsupervised text classification recipe from Chapter 4, Classifying Texts.
Getting ready
We will be using the sklearn.cluster.KMeans
object to do the unsupervised clustering, along with Hugging Face sentence transformers. To install sentence transformers, use the following commands:
conda create -n newenv python=3.6.10 anaconda conda install pytorch torchvision cudatoolkit=10.2 -c pytorch pip install transformers pip install -U sentence-transformers
How to do it…
The steps for this recipe are as follows:
- Perform the necessary imports:
import re import string import pandas as pd from sklearn.cluster import KMeans from nltk.probability import FreqDist from Chapter01.tokenization import tokenize_nltk...