You're reading from Python Natural Language Processing Cookbook Over 60 recipes for building powerful NLP solutions using Python and LLM libraries

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781803245744

Length 312 pages

Edition 2nd Edition

Languages

Processing

Tools

Processing

Concepts

GPT/LLMs

Authors (2):

Saurabh Chakravarty

Zhenya Antić

View More author details

Table of Contents (13) Chapters

Preface

1. Chapter 1: Learning NLP Basics

2. Chapter 2: Playing with Grammar FREE CHAPTER

3. Chapter 3: Representing Text – Capturing Semantics

4. Chapter 4: Classifying Texts

5. Chapter 5: Getting Started with Information Extraction

6. Chapter 6: Topic Modeling

7. Chapter 7: Visualizing Text Data

8. Chapter 8: Transformers and Their Applications

9. Chapter 9: Natural Language Understanding

10. Chapter 10: Generative AI and Large Language Models

11. Index

Why subscribe?

12. Other Books You May Enjoy

Using contextualized topic models

In this recipe, we will look at another topic model algorithm: contextualized topic models. To produce a more effective topic model, it combines embeddings with a bag-of-words document representation.

We will show you how to use the trained topic model with input in other languages. This feature is especially useful because we can create a topic model in one language, for example, one that has many resources available, and then apply it on another language that does not have as many resources. To achieve this, we will utilize a multilingual embedding model in order to encode the data.

Getting ready

We will need the contextualized-topic-models package for this recipe. It is part of the poetry environment and the requirements.txt file.

The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.5-contextualized-tm.ipynb.

How to do it...

In this recipe, we will load the data, then divide it into sentences, preprocess it, and use the gsdmm model to cluster the sentences into topics. If you would like more information about the algorithm, please see the package documentation at https://pypi.org/project/contextualized-topic-models/.

Do the necessary imports:

import pandas as pd
from nltk.corpus import stopwords
from contextualized_topic_models.utils.preprocessing import( 
    WhiteSpacePreprocessingStopwords)
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import( 
    TopicModelDataPreparation)

Suppress the warnings:

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category = DeprecationWarning)
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Create the stopwords list and read in the data:

stop_words = stopwords.words('english')
stop_words.append("said")
bbc_df = pd.read_csv("../data/bbc-text.csv")

In this step, we will create the preprocessor object and use it to preprocess the documents. The contextualized-topic-models package provides different preprocessors that prepare the data to be used in the topic model algorithm. This preprocessor tokenizes the documents, removes the stopwords, and puts them back into a string. It returns the list of preprocessed documents, the list of original documents, the dataset vocabulary, and a list of document indices in the original dataframe:
```
documents = bbc_df["text"]
preprocessor = WhiteSpacePreprocessingStopwords(
    documents, stopwords_list=stop_words)
preprocessed_documents,unpreprocessed_documents,vocab,indices =\
    preprocessor.preprocess()
```
Here, we will create the TopicModelDataPreparation object. We will pass the embedding model name as the parameter. This is a multilingual model that can encode text in various languages with good results. We will then fit it on the documents. It uses an embedding model to turn the texts into embeddings and also creates a bag-of-words model. The output is a CTMDataset object that represents the training dataset in the format required by the topic model training algorithm:
```
tp = TopicModelDataPreparation(
    "distiluse-base-multilingual-cased")
training_dataset = tp.fit(
    text_for_contextual=unpreprocessed_documents,
    text_for_bow=preprocessed_documents)
```
In this step, we will create the topic model using the ZeroShotTM object. The term zero shot means that the model has no prior information about the documents. We will input the size of the vocabulary for the bag-of-words model, the size of the embeddings vector, the number of topics (the n_components parameter), and the number of epochs to train the model for. We will use five topics, since the BBC dataset has that many topics. When you apply this algorithm to your data, you will need to experiment with different numbers of topics. Finally, we will fit the initialized topic model on the training dataset:
```
ctm = ZeroShotTM(bow_size=len(tp.vocab),
    contextual_size=512, n_components=5,
    num_epochs=100)
ctm.fit(training_dataset)
```
Here, we will inspect the topics. We can see that they fit well with the golden labels. Topic 0 is tech, topic 1 is sport, topic 2 is business, topic 3 is entertainment, and topic 4 is politics:
```
ctm.get_topics()
```
The results will vary; this is the output we get:

Figure 6.6 – The contextualized model output

Now, we will initialize a new news piece, this time in Spanish, to see how effective the topic model trained on English-language documents will be on a news article in a different language. This particular news piece should fall into the tech topic. We will preprocess it using the TopicModelDataPreparation object. To then use the model on the encoded text, we need to create a dataset object. That is why we have to include the Spanish news piece in a list and then pass it on for data preparation. Finally, we must pass the dataset (that consists of only one element) through the model:
```
spanish_news_piece = """IBM anuncia el comienzo de la "era de la utilidad cuántica" y anticipa un superordenador en 2033.
La compañía asegura haber alcanzado un sistema de computación que no se puede simular con procedimientos clásicos."""
testing_dataset = tp.transform([spanish_news_piece])
```
In this step, we will get the topic distribution for the testing dataset we created in the previous step. The result is a list of lists, where each individual list represents the probability that a particular text belongs to that topic. The probabilities have the same indices in individual lists as the topic numbers:
```
ctm.get_doc_topic_distribution(testing_dataset)
```
In this case, the highest probability is for topic 0, which is indeed tech:
```
array([[0.5902461,0.09361929,0.14041995,0.07586181,0.0998529 ]],
      dtype=float32)
```

You're reading from Python Natural Language Processing Cookbook Over 60 recipes for building powerful NLP solutions using Python and LLM libraries

Table of Contents (13) Chapters

Using contextualized topic models

Getting ready

How to do it...

See also

Authors (2)

Personalised recommendations for you

You're reading from Python Natural Language Processing Cookbook Over 60 recipes for building powerful NLP solutions using Python and LLM libraries

Table of Contents (13) Chapters

Using contextualized topic models

Getting ready

How to do it...

See also

Unlock this book and the full library FREE for 7 days

Authors (2)

Personalised recommendations for you