Using contextualized topic models
In this recipe, we will look at another topic model algorithm: contextualized topic models. To produce a more effective topic model, it combines embeddings with a bag-of-words document representation.
We will show you how to use the trained topic model with input in other languages. This feature is especially useful because we can create a topic model in one language, for example, one that has many resources available, and then apply it on another language that does not have as many resources. To achieve this, we will utilize a multilingual embedding model in order to encode the data.
Getting ready
We will need the contextualized-topic-models
package for this recipe. It is part of the poetry environment and the requirements.txt
file.
The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.5-contextualized-tm.ipynb.
How to do it...
In this recipe, we will load the data, then divide it into sentences, preprocess it, and use the gsdmm model to cluster the sentences into topics. If you would like more information about the algorithm, please see the package documentation at https://pypi.org/project/contextualized-topic-models/.
- Do the necessary imports:
import pandas as pd from nltk.corpus import stopwords from contextualized_topic_models.utils.preprocessing import( WhiteSpacePreprocessingStopwords) from contextualized_topic_models.models.ctm import ZeroShotTM from contextualized_topic_models.utils.data_preparation import( TopicModelDataPreparation)
- Suppress the warnings:
import warnings warnings.filterwarnings('ignore') warnings.filterwarnings("ignore", category = DeprecationWarning) import os os.environ["TOKENIZERS_PARALLELISM"] = "false"
- Create the stopwords list and read in the data:
stop_words = stopwords.words('english') stop_words.append("said") bbc_df = pd.read_csv("../data/bbc-text.csv")
- In this step, we will create the preprocessor object and use it to preprocess the documents. The
contextualized-topic-models
package provides different preprocessors that prepare the data to be used in the topic model algorithm. This preprocessor tokenizes the documents, removes the stopwords, and puts them back into a string. It returns the list of preprocessed documents, the list of original documents, the dataset vocabulary, and a list of document indices in the original dataframe:documents = bbc_df["text"] preprocessor = WhiteSpacePreprocessingStopwords( documents, stopwords_list=stop_words) preprocessed_documents,unpreprocessed_documents,vocab,indices =\ preprocessor.preprocess()
- Here, we will create the
TopicModelDataPreparation
object. We will pass the embedding model name as the parameter. This is a multilingual model that can encode text in various languages with good results. We will then fit it on the documents. It uses an embedding model to turn the texts into embeddings and also creates a bag-of-words model. The output is aCTMDataset
object that represents the training dataset in the format required by the topic model training algorithm:tp = TopicModelDataPreparation( "distiluse-base-multilingual-cased") training_dataset = tp.fit( text_for_contextual=unpreprocessed_documents, text_for_bow=preprocessed_documents)
- In this step, we will create the topic model using the
ZeroShotTM
object. The term zero shot means that the model has no prior information about the documents. We will input the size of the vocabulary for the bag-of-words model, the size of the embeddings vector, the number of topics (then_components
parameter), and the number of epochs to train the model for. We will use five topics, since the BBC dataset has that many topics. When you apply this algorithm to your data, you will need to experiment with different numbers of topics. Finally, we will fit the initialized topic model on the training dataset:ctm = ZeroShotTM(bow_size=len(tp.vocab), contextual_size=512, n_components=5, num_epochs=100) ctm.fit(training_dataset)
- Here, we will inspect the topics. We can see that they fit well with the golden labels. Topic 0 is tech, topic 1 is sport, topic 2 is business, topic 3 is entertainment, and topic 4 is politics:
ctm.get_topics()
The results will vary; this is the output we get:
Figure 6.6 – The contextualized model output
- Now, we will initialize a new news piece, this time in Spanish, to see how effective the topic model trained on English-language documents will be on a news article in a different language. This particular news piece should fall into the tech topic. We will preprocess it using the
TopicModelDataPreparation
object. To then use the model on the encoded text, we need to create a dataset object. That is why we have to include the Spanish news piece in a list and then pass it on for data preparation. Finally, we must pass the dataset (that consists of only one element) through the model:spanish_news_piece = """IBM anuncia el comienzo de la "era de la utilidad cuántica" y anticipa un superordenador en 2033. La compañía asegura haber alcanzado un sistema de computación que no se puede simular con procedimientos clásicos.""" testing_dataset = tp.transform([spanish_news_piece])
- In this step, we will get the topic distribution for the testing dataset we created in the previous step. The result is a list of lists, where each individual list represents the probability that a particular text belongs to that topic. The probabilities have the same indices in individual lists as the topic numbers:
ctm.get_doc_topic_distribution(testing_dataset)
In this case, the highest probability is for topic 0, which is indeed tech:
array([[0.5902461,0.09361929,0.14041995,0.07586181,0.0998529 ]], dtype=float32)
See also
For more information about contextualized topic models, see https://contextualized-topic-models.readthedocs.io/en/latest/index.html.