Text clustering with Sentence-BERT
For clustering algorithms, we will need a model that's suitable for textual similarity. Let's use the paraphrase-distilroberta-base-v1
model here for a change. We will start by loading the Amazon Polarity dataset for our clustering experiment. This dataset includes Amazon web page reviews spanning a period of 18 years up to March 2013. The original dataset includes over 35 million reviews. These reviews include product information, user information, user ratings, and user reviews. Let's get started:
- First, randomly select 10K reviews by shuffling, as follows:
import pandas as pd, numpy as np import torch, os, scipy from datasets import load_dataset dataset = load_dataset("amazon_polarity",split="train") corpus=dataset.shuffle(seed=42)[:10000]['content']
- The corpus is now ready for clustering. The following code instantiates a sentence-transformer object using the pre-trained
paraphrase-distilroberta...