Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Natural Language Processing Cookbook

You're reading from   Python Natural Language Processing Cookbook Over 60 recipes for building powerful NLP solutions using Python and LLM libraries

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781803245744
Length 312 pages
Edition 2nd Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Saurabh Chakravarty Saurabh Chakravarty
Author Profile Icon Saurabh Chakravarty
Saurabh Chakravarty
Zhenya Antić Zhenya Antić
Author Profile Icon Zhenya Antić
Zhenya Antić
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Chapter 1: Learning NLP Basics 2. Chapter 2: Playing with Grammar FREE CHAPTER 3. Chapter 3: Representing Text – Capturing Semantics 4. Chapter 4: Classifying Texts 5. Chapter 5: Getting Started with Information Extraction 6. Chapter 6: Topic Modeling 7. Chapter 7: Visualizing Text Data 8. Chapter 8: Transformers and Their Applications 9. Chapter 9: Natural Language Understanding 10. Chapter 10: Generative AI and Large Language Models 11. Index 12. Other Books You May Enjoy

Using contextualized topic models

In this recipe, we will look at another topic model algorithm: contextualized topic models. To produce a more effective topic model, it combines embeddings with a bag-of-words document representation.

We will show you how to use the trained topic model with input in other languages. This feature is especially useful because we can create a topic model in one language, for example, one that has many resources available, and then apply it on another language that does not have as many resources. To achieve this, we will utilize a multilingual embedding model in order to encode the data.

Getting ready

We will need the contextualized-topic-models package for this recipe. It is part of the poetry environment and the requirements.txt file.

The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.5-contextualized-tm.ipynb.

How to do it...

In this recipe, we will load the data, then divide it into sentences, preprocess it, and use the gsdmm model to cluster the sentences into topics. If you would like more information about the algorithm, please see the package documentation at https://pypi.org/project/contextualized-topic-models/.

  1. Do the necessary imports:
    import pandas as pd
    from nltk.corpus import stopwords
    from contextualized_topic_models.utils.preprocessing import( 
        WhiteSpacePreprocessingStopwords)
    from contextualized_topic_models.models.ctm import ZeroShotTM
    from contextualized_topic_models.utils.data_preparation import( 
        TopicModelDataPreparation)
  2. Suppress the warnings:
    import warnings
    warnings.filterwarnings('ignore')
    warnings.filterwarnings("ignore", category = DeprecationWarning)
    import os
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
  3. Create the stopwords list and read in the data:
    stop_words = stopwords.words('english')
    stop_words.append("said")
    bbc_df = pd.read_csv("../data/bbc-text.csv")
  4. In this step, we will create the preprocessor object and use it to preprocess the documents. The contextualized-topic-models package provides different preprocessors that prepare the data to be used in the topic model algorithm. This preprocessor tokenizes the documents, removes the stopwords, and puts them back into a string. It returns the list of preprocessed documents, the list of original documents, the dataset vocabulary, and a list of document indices in the original dataframe:
    documents = bbc_df["text"]
    preprocessor = WhiteSpacePreprocessingStopwords(
        documents, stopwords_list=stop_words)
    preprocessed_documents,unpreprocessed_documents,vocab,indices =\
        preprocessor.preprocess()
  5. Here, we will create the TopicModelDataPreparation object. We will pass the embedding model name as the parameter. This is a multilingual model that can encode text in various languages with good results. We will then fit it on the documents. It uses an embedding model to turn the texts into embeddings and also creates a bag-of-words model. The output is a CTMDataset object that represents the training dataset in the format required by the topic model training algorithm:
    tp = TopicModelDataPreparation(
        "distiluse-base-multilingual-cased")
    training_dataset = tp.fit(
        text_for_contextual=unpreprocessed_documents,
        text_for_bow=preprocessed_documents)
  6. In this step, we will create the topic model using the ZeroShotTM object. The term zero shot means that the model has no prior information about the documents. We will input the size of the vocabulary for the bag-of-words model, the size of the embeddings vector, the number of topics (the n_components parameter), and the number of epochs to train the model for. We will use five topics, since the BBC dataset has that many topics. When you apply this algorithm to your data, you will need to experiment with different numbers of topics. Finally, we will fit the initialized topic model on the training dataset:
    ctm = ZeroShotTM(bow_size=len(tp.vocab),
        contextual_size=512, n_components=5,
        num_epochs=100)
    ctm.fit(training_dataset)
  7. Here, we will inspect the topics. We can see that they fit well with the golden labels. Topic 0 is tech, topic 1 is sport, topic 2 is business, topic 3 is entertainment, and topic 4 is politics:
    ctm.get_topics()

    The results will vary; this is the output we get:

Figure 6.6 – The contextualized model output

  1. Now, we will initialize a new news piece, this time in Spanish, to see how effective the topic model trained on English-language documents will be on a news article in a different language. This particular news piece should fall into the tech topic. We will preprocess it using the TopicModelDataPreparation object. To then use the model on the encoded text, we need to create a dataset object. That is why we have to include the Spanish news piece in a list and then pass it on for data preparation. Finally, we must pass the dataset (that consists of only one element) through the model:
    spanish_news_piece = """IBM anuncia el comienzo de la "era de la utilidad cuántica" y anticipa un superordenador en 2033.
    La compañía asegura haber alcanzado un sistema de computación que no se puede simular con procedimientos clásicos."""
    testing_dataset = tp.transform([spanish_news_piece])
  2. In this step, we will get the topic distribution for the testing dataset we created in the previous step. The result is a list of lists, where each individual list represents the probability that a particular text belongs to that topic. The probabilities have the same indices in individual lists as the topic numbers:
    ctm.get_doc_topic_distribution(testing_dataset)

    In this case, the highest probability is for topic 0, which is indeed tech:

    array([[0.5902461,0.09361929,0.14041995,0.07586181,0.0998529 ]],
          dtype=float32)

See also

For more information about contextualized topic models, see https://contextualized-topic-models.readthedocs.io/en/latest/index.html.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime