Describing how BERTopic works
BERTopic uses the BERT word embedding vectors for topic modeling [5]. A document is first to be embedded into word vectors. The high-dimensional word vectors then go through dimensionality reduction in order to be clustered into topics. BERTopic has a sequence of five modular components, as shown in Figure 14.5. The five modules are designed to be as independent as possible so that data scientists can choose an alternative technique for a module. For instance, the clustering method HDSCAN can be replaced with K-means. These techniques are the default components for BERTopic. At the end of the chapter, I will illustrate how to model with their alternative techniques.
Figure 14.5 – BERTopic structure
Let’s review each technique in the next sections.
BERT – word embeddings
The first block converts a document into numerical representations. BERTopic uses the BERT-based word embeddings of paraphrase...