Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Handbook of NLP with Gensim

You're reading from   The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Arrow left icon
Product type Paperback
Published in Oct 2023
Publisher Packt
ISBN-13 9781803244945
Length 310 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Chris Kuo Chris Kuo
Author Profile Icon Chris Kuo
Chris Kuo
Arrow right icon
View More author details
Toc

Table of Contents (24) Chapters Close

Preface 1. Part 1: NLP Basics
2. Chapter 1: Introduction to NLP FREE CHAPTER 3. Chapter 2: Text Representation 4. Chapter 3: Text Wrangling and Preprocessing 5. Part 2: Latent Semantic Analysis/Latent Semantic Indexing
6. Chapter 4: Latent Semantic Analysis with scikit-learn 7. Chapter 5: Cosine Similarity 8. Chapter 6: Latent Semantic Indexing with Gensim 9. Part 3: Word2Vec and Doc2Vec
10. Chapter 7: Using Word2Vec 11. Chapter 8: Doc2Vec with Gensim 12. Part 4: Topic Modeling with Latent Dirichlet Allocation
13. Chapter 9: Understanding Discrete Distributions 14. Chapter 10: Latent Dirichlet Allocation 15. Chapter 11: LDA Modeling 16. Chapter 12: LDA Visualization 17. Chapter 13: The Ensemble LDA for Model Stability 18. Part 5: Comparison and Applications
19. Chapter 14: LDA and BERTopic 20. Chapter 15: Real-World Use Cases 21. Assessments 22. Index 23. Other Books You May Enjoy

NLU + NLG = NLP

NLP is an umbrella term that covers natural language understanding (NLU) and NLG. We’ll go through both in the next sections.

NLU

Many languages, such as English, German, and Chinese, have been developing for hundreds of years and continue to evolve. Humans can use languages artfully in various social contexts. Now, we are asking a computer to understand human language. What’s very rudimentary to us may not be so apparent to a computer. Linguists have contributed much to the development of computers’ understanding in terms of syntax, semantics, phonology, morphology, and pragmatics.

NLU focuses on understanding the meaning of human language. It extracts text or speech input and then analyzes the syntax, semantics, phonology, morphology, and pragmatics in the language. Let’s briefly go over each one:

  • Syntax: This is about the study of how words are arranged to form phrases and clauses, as well as the use of punctuation, order of words, and sentences.
  • Semantics: This is about the possible meanings of a sentence based on the interactions between words in the sentence. It is concerned with the interpretation of language, rather than its form or structure. For example, the word “table” as a noun can refer to “a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs” or a data frame in a computer language.

Let’s elaborate more on semantics with two jokes. The first example is as follows:

Patient: “Doctor, doctor! I’ve broken my arm in three places!”

Doctor: “Well, stop going to those places, then.”

The patient uses the word places to mean the spots on the arm; the doctor uses places to mean physical locations.

The second example is as follows:

My coworker: “Do you ever think about working from home?”

Me: “I don’t even think about work at work!”

The first work in my reply means the tasks in my work. The second work in at work means at one’s place of employment.

NLU can understand the two meanings of a word in such jokes through a technique called word embedding. We will learn more about this in Chapter 2, Text Representation.

  • Phonology: This is about the study of the sound system of a language, including the sounds of speech (phonemes), how they are combined to form words (morphology), and how they are organized into larger units such as syllables and stress patterns. For example, the sounds represented by the letters “p” and “b” in English are distinct phonemes. A phoneme is the smallest unit of sound in a language that can change the meaning of a word. Consider the words “pat” and “bat.” The only difference between these two words is the initial sound, but their meanings are different.
  • Morphology: This is the study of the structure of words, including the way in which they are formed from smaller units of meaning called morphemes. It originally comes from “morph,” the shape or form, and “ology,” the study of something. Morphology is important because it helps us understand how words are formed and how they relate to each other. It also helps us understand how words change over time and how they are related to other words in a language. For example, the word “unkindness” consists of three separate morphemes: the prefix “un-,” the root “kind,” and the suffix “-ness.”
  • Pragmatics: This is the study of how language is used in a social context. Pragmatics is important because it helps us understand how language works in real-world situations, and how language can be used to convey meaning and achieve specific purposes. For example, if you offer to buy your friend a McDonald’s burger, a large fries, and a large drink, your friend may reply "no" because he is concerned about becoming fat. Your friend may simply mean the burger meal is high in calories, but the conversation can also imply he may be fat in a social context.

Now, let’s understand NLG.

NLG

While NLU is concerned with reading for a computer to comprehend, NLG is about writing for a computer to write. The term generation in NLG refers to an NLP model generating meaningful words or even articles. Today, when you compose an email or type a sentence in an app, it presents possible words to complete your sentence or performs automatic correction. These are applications of NLG. As a result, the term generative AI is coined for generative models including language, voice, image, and video generation. ChatGPT and GPT-4 from OpenAI are probably the most famous examples of generative AI. I will briefly introduce ChatGPT, GPT-4, and other open source products, such as gpt.h2o.ai, and present the HuggingFace.co open source community.

ChatGPT and GPT-4

You can enter a prompt for ChatGPT to generate a poem, a prompt, a story, and so on. With its use of the reinforcement learning from human feedback (RLHF) technique, ChatGPT is designed to respond to questions in a way that sounds natural and human-like, making it easier for people to communicate with the model. It is also able to remember the conversations prior to the current conversation. If you ask ChatGPT, “Should I wear a blue shirt or a white shirt for tomorrow’s outdoor company meeting when it is likely to be hot and sunny?”, it will formulate a response by inferring a sequence of words that are likely to come next. It answers, “When it is hot and sunny, you may want to wear a white shirt.” If you reply, “How about on a gloomy day?”, ChatGPT understands that you mean to ask, in continuation to your prior question, about what to wear for tomorrow’s outdoor company meeting. It does not take the second question as an independent question and answer it randomly. This ability to remember and contextualize inputs is what gives ChatGPT the ability to carry on some semblance of a human conversation rather than give naïve, one-off answers. Hence, having the memory for long context is the key for next-generation language models such as GPT-4.

Generative Pre-trained Transformer 4 (GPT-4) was released by OpenAI on March 14, 2023. It is a transformer-based model that can predict the next word or token. It can answer questions, summarize text, translate text to other languages, and generate code, blog posts, stories, conversations, and other content types. The ability to remember and contextualize inputs, known as the context window is key for a language model. The context window of GPT-4 has been increased to roughly 8,000 words. Given the English conversation speed is about 120 words per minute and 7,200 words per hour, a context window of 8,000 words is more than an hour’s conversation.

What is the token or transformer mentioned in the previous paragraph? A token is the NLP term referring to a word separated from a sentence, or tokenized. A transformer is the milestone language model invented in 2017 by Vaswani et al [2]. In Chapter 2, Text Representation, I will explain how tokenization works in text preprocessing. In Chapter 13, The Ensemble LDA for Model Stability, I will give high-level explanations of a transformer model and others including the BERTopic model that builds on it.

Gpt.h2o.ai

GPT-H2O.ai is an LLM by H2O.ai. It is aimed at enterprise-grade LLMs such as ChatGPT, but is open source. I let gpt.h2o.ai define itself by asking “What is gpt.h2o.ai?” It answered: “gpt.h2o.ai is a large language model that was trained on a combination of web crawl data and Reddit comments. The model can generate human-like text and has been used for a variety of tasks, including generating code, answering questions, and summarizing text.”

Let’s look at its storytelling capability. I entered: “Write a fable on a lion that rescued a boy when the boy was under the attack of aliens.” It returned the following interesting story:

“Once upon a time, there was a young boy named Jack who loved to explore the woods near his home. One day, while he was out on one of his adventures, he came across a strange sight. There, in the middle of the forest, was a spaceship. It was unlike anything Jack had ever seen before. As he approached the ship, he heard a strange noise coming from inside. Jack peered through the window and saw a group of aliens inside. They were holding a young boy captive and seemed to be doing something to him. Jack knew he had to act quickly. Just then, a lion appeared out of nowhere. The lion roared loudly, scaring the aliens away. The boy was freed and the spaceship took off into the sky. The boy thanked the lion for saving him and asked how he knew to come to his rescue. The lion explained that it had been watching over him for some time and knew that he was in danger. The boy and the lion became friends. Jack never forgot the day that he saw the lion save the boy from the aliens.”

HuggingFace.co

New research on LLMs has burgeoned over the past few years. New LLMs, datasets, and solutions become hard to keep up with and create a new knowledge barrier for anyone who wants to enter the world of generative AI. Hugging Face provides a platform called HuggingFace.co that allows everyone to access open source LLMs, academic papers, and various datasets that trained the LLMs on its platform. You can also share your LLMs and datasets.

You may ask why LLMs can answer questions, write book reports, draft notes, or summarize documents. An important factor is the data on which they were trained or with which they were fine-tuned. These large-scale datasets include “News/Wikipedia,” “Web crawling,” “Questions and Answers (Q&A),” “Books,” and “Reading comprehension.” In “Large Language Model Datasets” [3], I have given a more detailed review of the datasets that trained prominent LLMs such as GPT-2, GPT-3, GPT-4, and so on.

I trust you have gained a better understanding of NLU and NLG. Next, I will introduce the NLP techniques covered by Gensim.

You have been reading a chapter from
The Handbook of NLP with Gensim
Published in: Oct 2023
Publisher: Packt
ISBN-13: 9781803244945
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime