LDA and BERTopic
Since the Transformer model came to the NLP stage in 2017 in the seminar paper Attention Is All You Need [1], many Transformer-based large language models (LLMs) such as BERT (Bidirectional Encoder Representations from Transformers) [3], ChatGPT, and GPT-4 [4] have seized the technology headlines. The word embeddings by these LLMs can discover more latent semantic relationships between words and documents than those by pre-LLM techniques such as BoW, TF-IDF, or Word2Vec.
The semantic relationships between words and documents naturally extend to document grouping, which is the aim of topic modeling that clusters documents into homogeneous document groups. Can we take advantage of the word embeddings of LLMs for topic modeling? This advantage motivates research in LLMs for topic modeling. An important topic modeling technique of this line is called BERTopic. It adopts the BERT word embeddings and includes multiple techniques such as UMAP, HDBSCAN, c-TFIDF, and MMR...