You're reading from The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781803244945

Length 310 pages

Edition 1st Edition

Tools

fastText

Concepts

Mobile Application Development

Author (1):

Chris Kuo

View More author details

Table of Contents (24) Chapters

Preface

1. Part 1: NLP Basics

2. Chapter 1: Introduction to NLP FREE CHAPTER

3. Chapter 2: Text Representation

4. Chapter 3: Text Wrangling and Preprocessing

5. Part 2: Latent Semantic Analysis/Latent Semantic Indexing

6. Chapter 4: Latent Semantic Analysis with scikit-learn

7. Chapter 5: Cosine Similarity

8. Chapter 6: Latent Semantic Indexing with Gensim

9. Part 3: Word2Vec and Doc2Vec

10. Chapter 7: Using Word2Vec

11. Chapter 8: Doc2Vec with Gensim

12. Part 4: Topic Modeling with Latent Dirichlet Allocation

13. Chapter 9: Understanding Discrete Distributions

14. Chapter 10: Latent Dirichlet Allocation

15. Chapter 11: LDA Modeling

16. Chapter 12: LDA Visualization

17. Chapter 13: The Ensemble LDA for Model Stability

18. Part 5: Comparison and Applications

19. Chapter 14: LDA and BERTopic

20. Chapter 15: Real-World Use Cases

21. Assessments

22. Index

Why subscribe?

23. Other Books You May Enjoy

NLU + NLG = NLP

NLP is an umbrella term that covers natural language understanding (NLU) and NLG. We’ll go through both in the next sections.

NLU

Many languages, such as English, German, and Chinese, have been developing for hundreds of years and continue to evolve. Humans can use languages artfully in various social contexts. Now, we are asking a computer to understand human language. What’s very rudimentary to us may not be so apparent to a computer. Linguists have contributed much to the development of computers’ understanding in terms of syntax, semantics, phonology, morphology, and pragmatics.

NLU focuses on understanding the meaning of human language. It extracts text or speech input and then analyzes the syntax, semantics, phonology, morphology, and pragmatics in the language. Let’s briefly go over each one:

Syntax: This is about the study of how words are arranged to form phrases and clauses, as well as the use of punctuation, order of words, and sentences.
Semantics: This is about the possible meanings of a sentence based on the interactions between words in the sentence. It is concerned with the interpretation of language, rather than its form or structure. For example, the word “table” as a noun can refer to “a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs” or a data frame in a computer language.

Let’s elaborate more on semantics with two jokes. The first example is as follows:

Patient: “Doctor, doctor! I’ve broken my arm in three places!”

Doctor: “Well, stop going to those places, then.”

The patient uses the word places to mean the spots on the arm; the doctor uses places to mean physical locations.

The second example is as follows:

My coworker: “Do you ever think about working from home?”

Me: “I don’t even think about work at work!”

The first work in my reply means the tasks in my work. The second work in at work means at one’s place of employment.

NLU can understand the two meanings of a word in such jokes through a technique called word embedding. We will learn more about this in Chapter 2, Text Representation.

Phonology: This is about the study of the sound system of a language, including the sounds of speech (phonemes), how they are combined to form words (morphology), and how they are organized into larger units such as syllables and stress patterns. For example, the sounds represented by the letters “p” and “b” in English are distinct phonemes. A phoneme is the smallest unit of sound in a language that can change the meaning of a word. Consider the words “pat” and “bat.” The only difference between these two words is the initial sound, but their meanings are different.
Morphology: This is the study of the structure of words, including the way in which they are formed from smaller units of meaning called morphemes. It originally comes from “morph,” the shape or form, and “ology,” the study of something. Morphology is important because it helps us understand how words are formed and how they relate to each other. It also helps us understand how words change over time and how they are related to other words in a language. For example, the word “unkindness” consists of three separate morphemes: the prefix “un-,” the root “kind,” and the suffix “-ness.”
Pragmatics: This is the study of how language is used in a social context. Pragmatics is important because it helps us understand how language works in real-world situations, and how language can be used to convey meaning and achieve specific purposes. For example, if you offer to buy your friend a McDonald’s burger, a large fries, and a large drink, your friend may reply "no" because he is concerned about becoming fat. Your friend may simply mean the burger meal is high in calories, but the conversation can also imply he may be fat in a social context.

Now, let’s understand NLG.

NLG

While NLU is concerned with reading for a computer to comprehend, NLG is about writing for a computer to write. The term generation in NLG refers to an NLP model generating meaningful words or even articles. Today, when you compose an email or type a sentence in an app, it presents possible words to complete your sentence or performs automatic correction. These are applications of NLG. As a result, the term generative AI is coined for generative models including language, voice, image, and video generation. ChatGPT and GPT-4 from OpenAI are probably the most famous examples of generative AI. I will briefly introduce ChatGPT, GPT-4, and other open source products, such as gpt.h2o.ai, and present the HuggingFace.co open source community.

ChatGPT and GPT-4

You can enter a prompt for ChatGPT to generate a poem, a prompt, a story, and so on. With its use of the reinforcement learning from human feedback (RLHF) technique, ChatGPT is designed to respond to questions in a way that sounds natural and human-like, making it easier for people to communicate with the model. It is also able to remember the conversations prior to the current conversation. If you ask ChatGPT, “Should I wear a blue shirt or a white shirt for tomorrow’s outdoor company meeting when it is likely to be hot and sunny?”, it will formulate a response by inferring a sequence of words that are likely to come next. It answers, “When it is hot and sunny, you may want to wear a white shirt.” If you reply, “How about on a gloomy day?”, ChatGPT understands that you mean to ask, in continuation to your prior question, about what to wear for tomorrow’s outdoor company meeting. It does not take the second question as an independent question and answer it randomly. This ability to remember and contextualize inputs is what gives ChatGPT the ability to carry on some semblance of a human conversation rather than give naïve, one-off answers. Hence, having the memory for long context is the key for next-generation language models such as GPT-4.

Generative Pre-trained Transformer 4 (GPT-4) was released by OpenAI on March 14, 2023. It is a transformer-based model that can predict the next word or token. It can answer questions, summarize text, translate text to other languages, and generate code, blog posts, stories, conversations, and other content types. The ability to remember and contextualize inputs, known as the context window is key for a language model. The context window of GPT-4 has been increased to roughly 8,000 words. Given the English conversation speed is about 120 words per minute and 7,200 words per hour, a context window of 8,000 words is more than an hour’s conversation.

What is the token or transformer mentioned in the previous paragraph? A token is the NLP term referring to a word separated from a sentence, or tokenized. A transformer is the milestone language model invented in 2017 by Vaswani et al [2]. In Chapter 2, Text Representation, I will explain how tokenization works in text preprocessing. In Chapter 13, The Ensemble LDA for Model Stability, I will give high-level explanations of a transformer model and others including the BERTopic model that builds on it.

Gpt.h2o.ai

GPT-H2O.ai is an LLM by H2O.ai. It is aimed at enterprise-grade LLMs such as ChatGPT, but is open source. I let gpt.h2o.ai define itself by asking “What is gpt.h2o.ai?” It answered: “gpt.h2o.ai is a large language model that was trained on a combination of web crawl data and Reddit comments. The model can generate human-like text and has been used for a variety of tasks, including generating code, answering questions, and summarizing text.”

Let’s look at its storytelling capability. I entered: “Write a fable on a lion that rescued a boy when the boy was under the attack of aliens.” It returned the following interesting story:

“Once upon a time, there was a young boy named Jack who loved to explore the woods near his home. One day, while he was out on one of his adventures, he came across a strange sight. There, in the middle of the forest, was a spaceship. It was unlike anything Jack had ever seen before. As he approached the ship, he heard a strange noise coming from inside. Jack peered through the window and saw a group of aliens inside. They were holding a young boy captive and seemed to be doing something to him. Jack knew he had to act quickly. Just then, a lion appeared out of nowhere. The lion roared loudly, scaring the aliens away. The boy was freed and the spaceship took off into the sky. The boy thanked the lion for saving him and asked how he knew to come to his rescue. The lion explained that it had been watching over him for some time and knew that he was in danger. The boy and the lion became friends. Jack never forgot the day that he saw the lion save the boy from the aliens.”

HuggingFace.co

New research on LLMs has burgeoned over the past few years. New LLMs, datasets, and solutions become hard to keep up with and create a new knowledge barrier for anyone who wants to enter the world of generative AI. Hugging Face provides a platform called HuggingFace.co that allows everyone to access open source LLMs, academic papers, and various datasets that trained the LLMs on its platform. You can also share your LLMs and datasets.

You may ask why LLMs can answer questions, write book reports, draft notes, or summarize documents. An important factor is the data on which they were trained or with which they were fine-tuned. These large-scale datasets include “News/Wikipedia,” “Web crawling,” “Questions and Answers (Q&A),” “Books,” and “Reading comprehension.” In “Large Language Model Datasets” [3], I have given a more detailed review of the datasets that trained prominent LLMs such as GPT-2, GPT-3, GPT-4, and so on.

I trust you have gained a better understanding of NLU and NLG. Next, I will introduce the NLP techniques covered by Gensim.