Building an Ensemble LDA model with Gensim
The procedure is very similar to that of building the LDA model in the previous chapter. The Ensemble LDA requires text preprocessing.
Preprocessing the training data
Let’s load the training data:
import pandas as pdimport numpy as np pd.set_option(‘display.max_colwidth’, -1) path = ‘/content/gdrive/My Drive/data/gensim’ train = pd.read_csv(path + “/ag_news_train.csv”)
We will tokenize the words in each sentence:
from gensim.parsing.preprocessing import preprocess_stringtext_tokenized = [] for doc in train[‘Description’]: k = preprocess_string(doc) text_tokenized.append(k) text_tokenized[0:3]
The output looks like this:
[[‘reuter’, ‘short’, ‘seller’, ‘wall’, ‘street’, ‘dwindl’, ‘band’, ‘ultra’, ‘...