Building a BERTopic model
Because BERTopic is a Transformer-based model, in general, there is no need to preprocess the texts such as with stop word removal or lemmatization. Keeping the original structure of the text is important in the Transformer-based approach. Stop words are usually non-informative. If a document has a lot of stop words such as he, she, and they, the document is likely to have the non-informative topic -1, which we will see shortly. That being said, nowadays, many texts have typos and nouns can be singular or plural; the outcome of a BERTopic model on an unlemmatized corpus may have redundant keywords such as court and courts, or cup and cups. You still can apply stop word removal and lemmatization to compare the outcome.
Loading the data – no text preprocessing
I will load the same AG news data that we have been using in this book:
import pandas as pdimport numpy as np pd.set_option('display.max_colwidth', -1) path = “/content...