Text preprocessing
We discussed earlier that good text preprocessing yields a good model outcome. Text preprocessing includes stop word removal and lemmatization. Some domain-specific common words may be considered too common and can be removed as well. You are advised to perform text preprocessing for LDA modeling too. LSA/LSI, LDA, and Ensemble LDA all require text preprocessing for a better modeling outcome. In contrast, Word2Vec, Doc2Vec, text summarization, and Bidirectional Encoder Representations From Transformers (BERT) topic modeling do not necessarily need text preprocessing.
This chapter uses the same AG’s corpus data of news articles (as mentioned in the Preface) so that you can focus more on learning techniques rather than different datasets. The text preprocessing task here is very similar to that in Chapter 6, Latent Semantic Indexing with Gensim. Hence, I will just go through the same text preprocessing code without offering much detail.