Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
The Handbook of NLP with Gensim
The Handbook of NLP with Gensim

The Handbook of NLP with Gensim: Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

eBook
€20.98 €29.99
Paperback
€37.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

The Handbook of NLP with Gensim

Introduction to NLP

“Why do we need NLP?” You may ask this question as you've witnessed the advancement of natural language processing (NLP) in recent years. Let’s see how NLP helped a well-established investment firm named "Harmony Investments." For decades, Harmony Investments had been renowned for its astute financial strategies and portfolio management, ranging from stocks and bonds to real estate and alternative investments. However, the sheer volume and variety of data sources, including news articles, earnings reports, social media posts, and financial statements, made it nearly impossible to manually analyze all the information. The firm's analysts were spending an excessive amount of time collecting and reviewing data. Recognizing the need for a more efficient and data-driven approach, the firm partnered with a leading AI solutions provider to implement NLP-driven solutions into their business operations. They used NLP algorithms to review news articles, press releases, and social media platforms in real time. This analysis enabled the firm to react swiftly. They used NLP tools that automatically summarized lengthy earning reports. This reduced the time the analysts spent on manual document review. They used NLP-powered sentiment analysis to gauge public sentiment surrounding specific stocks or market segments. Analysts had more time for strategic research and developing innovative investment strategies. As a result, Harmony Investments not only retained its reputation as a leading investment firm but also attracted new clients and expanded its portfolio.

Joe is a data scientist who is new to NLP. He and his data analyst colleague, Jacob, are interested in learning NLP techniques. They want to acquire the NLP techniques that can deliver the NLP benefits as discussed. They have certainly heard of ChatGPT and all the news about large language models (LLMs). They want to learn NLP systematically, from concepts to practice, and want to find a textbook that can bridge them to LLMs without diving into LLMs first. If you are like Joe or Jacob, then this book is for you.

A fundamental step in NLP for computers to understand texts is text representation, which convert a collection of text documents into numerical values. Each document is represented as a vector in a high-dimensional space, where each dimension corresponds to a unique word in the entire corpus. This helps computers understand what words mean and how they relate to each other in sentences. This book starts with bag-of-words (BoW), bag-of-N-grams, term frequency-inverse document frequency (TF-IDF). An advance to text representation is the word embedding techniques. Word embeddings are dense vector representations of words that capture semantic relationships between words based on their context in a large dataset. Word embeddings, like Word2Vec, create continuous vector representations where words with similar meanings have similar vector representations, and they capture semantic and syntactic relationships.

Topic modeling is a significant NLP subject. It classifies documents into topics for document retrieval, categorization, tagging, or annotation. This book gives more insight into the milestone topic modeling technique, Latent Dirichlet Allocation (LDA). In addition, another milestone topic modeling technique is BERTopic. Let me briefly describe the development history of Bidirectional Encoder Representations from Transformers (BERT). The seminal paper “Attention is all you need” by Vaswani et al. [2] enables many transformer-based word embeddings and LLMs. One of the word embeddings is BERT. Can we do topic modeling to classify documents based on BERT word embeddings? That’s the origin of BERTopic. I have included BERTopic in this book together with LDA so you get to see the differences. This will provide a bridge to the transformer-based NLP techniques.

This book is a practical handbook with code snippets. I will cover many techniques in the Gensim library. Gensim is an open source Python library for topic modeling, document clustering, and other unsupervised learning tasks on collections of textual documents. It provides a high-level interface for building and training a variety of models. Gensim stands for generate similar. It finds the similarities between documents to summarize texts or to classify documents into topics.

In this chapter, we will cover the following topics:

  • Introduction to natural language processing
  • NLU + NLG = NLP
  • Gensim and its NLP modeling techniques
  • Topic modeling with BERTopic
  • Common NLP Python modules included in this book

After completing this chapter, you will get to know the development history of NLP. You will be able to explain the key NLP techniques that Gensim covers. You will also understand other popular NLP Python libraries that are often used together.

Introduction to natural language processing

NLP is based on 50 years of rich research into linguistics and processing algorithms. It is a branch of computer science or artificial intelligence (AI) that uses computer algorithms to analyze, understand, and generate human language data. The algorithms process human language to “understand” its full meaning. NLP has a wide range of applications that include the following:

  • Text mining: Extracting information from large amounts of text data, such as documents, emails, and social media posts.
  • Information retrieval: Searching for relevant information in large text databases. In this book, you will learn many techniques for information retrieval.
  • Question answering: Answering questions posed in natural language.
  • Machine translation: Translating text from one language to another.
  • Sentiment analysis: Identifying the tone and emotion of text data.
  • Natural language generation (NLG): Generating text that mimics human language.

As I said before, NLP has a long development history. Let’s look into it briefly.

NLU + NLG = NLP

NLP is an umbrella term that covers natural language understanding (NLU) and NLG. We’ll go through both in the next sections.

NLU

Many languages, such as English, German, and Chinese, have been developing for hundreds of years and continue to evolve. Humans can use languages artfully in various social contexts. Now, we are asking a computer to understand human language. What’s very rudimentary to us may not be so apparent to a computer. Linguists have contributed much to the development of computers’ understanding in terms of syntax, semantics, phonology, morphology, and pragmatics.

NLU focuses on understanding the meaning of human language. It extracts text or speech input and then analyzes the syntax, semantics, phonology, morphology, and pragmatics in the language. Let’s briefly go over each one:

  • Syntax: This is about the study of how words are arranged to form phrases and clauses, as well as the use of punctuation, order of words, and sentences.
  • Semantics: This is about the possible meanings of a sentence based on the interactions between words in the sentence. It is concerned with the interpretation of language, rather than its form or structure. For example, the word “table” as a noun can refer to “a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs” or a data frame in a computer language.

Let’s elaborate more on semantics with two jokes. The first example is as follows:

Patient: “Doctor, doctor! I’ve broken my arm in three places!”

Doctor: “Well, stop going to those places, then.”

The patient uses the word places to mean the spots on the arm; the doctor uses places to mean physical locations.

The second example is as follows:

My coworker: “Do you ever think about working from home?”

Me: “I don’t even think about work at work!”

The first work in my reply means the tasks in my work. The second work in at work means at one’s place of employment.

NLU can understand the two meanings of a word in such jokes through a technique called word embedding. We will learn more about this in Chapter 2, Text Representation.

  • Phonology: This is about the study of the sound system of a language, including the sounds of speech (phonemes), how they are combined to form words (morphology), and how they are organized into larger units such as syllables and stress patterns. For example, the sounds represented by the letters “p” and “b” in English are distinct phonemes. A phoneme is the smallest unit of sound in a language that can change the meaning of a word. Consider the words “pat” and “bat.” The only difference between these two words is the initial sound, but their meanings are different.
  • Morphology: This is the study of the structure of words, including the way in which they are formed from smaller units of meaning called morphemes. It originally comes from “morph,” the shape or form, and “ology,” the study of something. Morphology is important because it helps us understand how words are formed and how they relate to each other. It also helps us understand how words change over time and how they are related to other words in a language. For example, the word “unkindness” consists of three separate morphemes: the prefix “un-,” the root “kind,” and the suffix “-ness.”
  • Pragmatics: This is the study of how language is used in a social context. Pragmatics is important because it helps us understand how language works in real-world situations, and how language can be used to convey meaning and achieve specific purposes. For example, if you offer to buy your friend a McDonald’s burger, a large fries, and a large drink, your friend may reply "no" because he is concerned about becoming fat. Your friend may simply mean the burger meal is high in calories, but the conversation can also imply he may be fat in a social context.

Now, let’s understand NLG.

NLG

While NLU is concerned with reading for a computer to comprehend, NLG is about writing for a computer to write. The term generation in NLG refers to an NLP model generating meaningful words or even articles. Today, when you compose an email or type a sentence in an app, it presents possible words to complete your sentence or performs automatic correction. These are applications of NLG. As a result, the term generative AI is coined for generative models including language, voice, image, and video generation. ChatGPT and GPT-4 from OpenAI are probably the most famous examples of generative AI. I will briefly introduce ChatGPT, GPT-4, and other open source products, such as gpt.h2o.ai, and present the HuggingFace.co open source community.

ChatGPT and GPT-4

You can enter a prompt for ChatGPT to generate a poem, a prompt, a story, and so on. With its use of the reinforcement learning from human feedback (RLHF) technique, ChatGPT is designed to respond to questions in a way that sounds natural and human-like, making it easier for people to communicate with the model. It is also able to remember the conversations prior to the current conversation. If you ask ChatGPT, “Should I wear a blue shirt or a white shirt for tomorrow’s outdoor company meeting when it is likely to be hot and sunny?”, it will formulate a response by inferring a sequence of words that are likely to come next. It answers, “When it is hot and sunny, you may want to wear a white shirt.” If you reply, “How about on a gloomy day?”, ChatGPT understands that you mean to ask, in continuation to your prior question, about what to wear for tomorrow’s outdoor company meeting. It does not take the second question as an independent question and answer it randomly. This ability to remember and contextualize inputs is what gives ChatGPT the ability to carry on some semblance of a human conversation rather than give naïve, one-off answers. Hence, having the memory for long context is the key for next-generation language models such as GPT-4.

Generative Pre-trained Transformer 4 (GPT-4) was released by OpenAI on March 14, 2023. It is a transformer-based model that can predict the next word or token. It can answer questions, summarize text, translate text to other languages, and generate code, blog posts, stories, conversations, and other content types. The ability to remember and contextualize inputs, known as the context window is key for a language model. The context window of GPT-4 has been increased to roughly 8,000 words. Given the English conversation speed is about 120 words per minute and 7,200 words per hour, a context window of 8,000 words is more than an hour’s conversation.

What is the token or transformer mentioned in the previous paragraph? A token is the NLP term referring to a word separated from a sentence, or tokenized. A transformer is the milestone language model invented in 2017 by Vaswani et al [2]. In Chapter 2, Text Representation, I will explain how tokenization works in text preprocessing. In Chapter 13, The Ensemble LDA for Model Stability, I will give high-level explanations of a transformer model and others including the BERTopic model that builds on it.

Gpt.h2o.ai

GPT-H2O.ai is an LLM by H2O.ai. It is aimed at enterprise-grade LLMs such as ChatGPT, but is open source. I let gpt.h2o.ai define itself by asking “What is gpt.h2o.ai?” It answered: “gpt.h2o.ai is a large language model that was trained on a combination of web crawl data and Reddit comments. The model can generate human-like text and has been used for a variety of tasks, including generating code, answering questions, and summarizing text.”

Let’s look at its storytelling capability. I entered: “Write a fable on a lion that rescued a boy when the boy was under the attack of aliens.” It returned the following interesting story:

“Once upon a time, there was a young boy named Jack who loved to explore the woods near his home. One day, while he was out on one of his adventures, he came across a strange sight. There, in the middle of the forest, was a spaceship. It was unlike anything Jack had ever seen before. As he approached the ship, he heard a strange noise coming from inside. Jack peered through the window and saw a group of aliens inside. They were holding a young boy captive and seemed to be doing something to him. Jack knew he had to act quickly. Just then, a lion appeared out of nowhere. The lion roared loudly, scaring the aliens away. The boy was freed and the spaceship took off into the sky. The boy thanked the lion for saving him and asked how he knew to come to his rescue. The lion explained that it had been watching over him for some time and knew that he was in danger. The boy and the lion became friends. Jack never forgot the day that he saw the lion save the boy from the aliens.”

HuggingFace.co

New research on LLMs has burgeoned over the past few years. New LLMs, datasets, and solutions become hard to keep up with and create a new knowledge barrier for anyone who wants to enter the world of generative AI. Hugging Face provides a platform called HuggingFace.co that allows everyone to access open source LLMs, academic papers, and various datasets that trained the LLMs on its platform. You can also share your LLMs and datasets.

You may ask why LLMs can answer questions, write book reports, draft notes, or summarize documents. An important factor is the data on which they were trained or with which they were fine-tuned. These large-scale datasets include “News/Wikipedia,” “Web crawling,” “Questions and Answers (Q&A),” “Books,” and “Reading comprehension.” In “Large Language Model Datasets” [3], I have given a more detailed review of the datasets that trained prominent LLMs such as GPT-2, GPT-3, GPT-4, and so on.

I trust you have gained a better understanding of NLU and NLG. Next, I will introduce the NLP techniques covered by Gensim.

Gensim and its NLP modeling techniques

Gensim is actively maintained and supported by a community of developers and is widely used in academic research and industry applications. It covers many important NLP techniques that make up the workforce of today’s NLP. That’s one of the reasons why I have developed this book to help data scientists.

Last year, I was at a company’s year-end party. The ballroom was filled with people standing in groups with their drinks. I walked around and listened for conversation topics where I could chime in. I heard one group talking about the FIFA World Cup 2022 and another group talking about stock markets. I joined the stock markets conversation. In that short moment, my mind had performed “word extractions,” “text summarization,” and “topic classifications.” These tasks are the core tasks of NLP and what Gensim is designed to do.

We perform serious text analyses in professional fields including legal, medical, and business. We organize similar documents into topics. Such work also demands “word extractions,” “text summarization,” and “topic classifications.” In the following sections, I will give you a brief introduction to the key models that Gensim offers so you will have a good overview. These models include the following:

  • BoW and TF-IDF
  • Latent semantic analysis/indexing (LSA/LSI)
  • Word2Vec
  • Doc2Vec
  • Text summarization
  • LDA
  • Ensemble LDA

BoW and TF-IDF

Texts can be represented as a bag of words, which is the count frequency of a word. Consider the following two phrases:

  • Phrase 1: All the stars we steal from the night sky
  • Phrase 2: Will never be enough, never be enough, never be enough for me

The BoW presents the word count frequency as shown in Figure 1.1. For example, the word the in the first sentence appears twice, so it is coded as 2; the word be in the second sentence appears three times, so it is coded as 3:

Figure 1.1 – BoW encoding (also presented in the next chapter)

Figure 1.1 – BoW encoding (also presented in the next chapter)

BoW uses the word count to reflect the significance of a word. However, this is not very intuitive. Frequent words may not carry special meanings depending on the type of document. For example, in clinical reports, the words physician, patient, doctor, and nurse appear frequently. The high frequency of these words may overshadow specific words such as bronchitis or stroke in a patient’s document. A better encoding system is to compare the relative word appearance in a document to its appearance throughout the corpus. TF-IDF is designed to reflect the importance of a word in a document by calculating its relevance to a corpus. We will learn the details of this in Chapter 2, Text Representation. At this moment, you just need to know that both BoW and TF-IDF are variations of text representation. They are the building blocks of NLP.

Although BoW and TF-IDF appear simple, they already have real-world applications in different fields. An important application of BoW and TF-IDF is to prevent spam emails from going to the inbox folder of an email account. Spam emails are ubiquitous, unavoidable, and quickly fill up the spam folder. BoW or TF-IDF will help to distinguish the characteristics of a spam email from regular emails.

LSA/LSI

Suppose you were a football fan and searching in a library using the keywords famous World Cup players, and that the system can only do exact key word match. The old computer system returned all articles that contained famous, world, cup, or players. It also returned a lot of other unrelated articles such as famous singer, 150 most famous poems, and world-renowned scientist. This is terrible, isn’t it? A simple keyword match cannot serve as a search engine.

Latent semantic analysis (LSA) was developed in the 1990s. It's an NLP solution that far surpasses naïve keyword matching and has become an important search engine algorithm. Prior to that, in 1988, an LSA-based information retrieval system was patented (US Patent #4839853, now expired) and named “latent semantic indexing,” so the technique is also called latent semantic indexing (LSI). Gensim and many other reports name LSA as LSI so as not to confuse LSA with LDA. In this book, I will adopt the same naming convention. In Chapter 6, Latent Semantic Indexing with Gensim, I will show you the code example to build an LSI model.

You can search with keywords such as the following:

Crude prices inflation the economy outlook earnings

This can return relevant news articles. One of the results is as follows:

A huge jump in wholesale prices sent stocks falling yesterday as investors worried that rising oil prices were taking a toll on the overall economy. (Data source: AG news data)

Notice it searches by meaning but not by word matching.

Word2Vec

The Word2Vec technique developed by Mikolov et al. [4] in 2014 was a significant milestone in NLP. Its idea was ground-breaking — it embeds words or phrases from a text corpus as dense, continuous-valued vectors, hence the name word-to-vector. These vector representations capture semantic relationships and contextual information between words. Its applications are prevalent in many recommendation systems. Figure 1.2 shows other words that are close to the word iron including gunpowder, metals, and steel; words far from iron are organic, sugar, and grain:

Figure 1.2 – An overview of Word2Vec (also presented in Chapter 7)

Figure 1.2 – An overview of Word2Vec (also presented in Chapter 7)

Also, the relative distance of words measures the similarity of meanings. Word2Vec enables us to measure and visualize the similarities or dissimilarities of words or concepts. This is a fantastic innovation.

Can you see how this idea can also apply to movie recommendations? Each movie can be considered a word in someone’s watching history. I Googled the words movie recommendations, and it returned many movies under “Top picks for you”:

Figure 1.3 – An overview of Word2Vec as a movie recommendation system (also presented in Chapter 7)

Figure 1.3 – An overview of Word2Vec as a movie recommendation system (also presented in Chapter 7)

Doc2Vec

Word2Vec represents a word with a vector. Can we represent a sentence or a paragraph with a vector? Doc2Vec is designed to do so. Doc2Vec transforms articles into vectors and enable semantic search for related articles. Doc2Vec has enabled many commercial products. For example, when you search for a job on LinkedIn.com or Indeed.com, you see similar job postings presented next to your target job posting. It is done by Doc2Vec. In Chapter 8, Doc2Vec with Gensim, you will build a real Doc2Vec model with code examples.

LDA

When documents are tagged by topic, we can retrieve the documents easily. In the old days, if you went to a library for books of a certain genre, you used the indexing system to find them. Now, with all the digital content, documents can be tagged systematically by topic modeling techniques.

The preceding library example may be an easier one, if compared to all sorts of social media posts, job posts, emails, news articles, or tweets. Topic models can tag digital content for effective searching or retrieving. LDA is an important topic modeling technique and has many commercial use cases. Figure 1.4 shows a snapshot of the LDA model output that we will build in Chapter 11, LDA Modeling:

Figure 1.4 – pyLDAvis (also presented in Chapter 11)

Figure 1.4 – pyLDAvis (also presented in Chapter 11)

Each bubble represents a topic. The distance between any two bubbles represents the difference between the two topics. The red bars on top of the blue bars represent the estimated frequency of a word for a chosen topic. Documents on the same topic are similar in their content. If you are reading a document that belongs to Topic 75 and want to read more related articles, LDA can return other articles on Topic 75.

Ensemble LDA

The goal of topic modeling for a set of documents is to find topics that are reliable and reproducible. If you replicate the same modeling process for the same documents, you expect to produce the same set of topics. However, past experiments have shown that while iterations for the same model produce the same set of topics, some iterations can produce extra topics. This creates a serious issue in practice: Which model outcome is the correct one to use? This issue seriously limits the applications of LDA. So, we think of the ensemble method in machine learning. In machine learning, an ensemble is a technique that combines multiple individual models to improve predictive accuracy and generalization performance. Ensemble LDA builds many models to identify a core set of topics that is reliable and reproducible all the time. In Chapter 13, The Ensemble LDA for Model Stability, I will explain the algorithm with visual aids in more detail. We also will build our own model with code examples.

Topic modeling with BERTopic

BERTopic is a topic modeling algorithm that is based on the BERT word embeddings. In Chapter 14, LDA and BERTopic, we will learn the key components of BERTopic and build our own model. In addition, the BERTopic modeling has its own visualization functions that are similar to pyLDAvis, as seen in Figure 1.4. We will learn to use all the visualization functions as well.

Figure 1.5 shows the top words for eight topics:

ght topics:

Figure 1.5 – An overview of topic modeling results by BERTopic (also presented in Chapter 14)

Figure 1.5 – An overview of topic modeling results by BERTopic (also presented in Chapter 14)

I trust these introductions have given you a strong appetite to dive into each chapter and apply the models discussed in your future work. Now, let's get familiar with the terminology commonly used in NLP.

Common NLP Python modules included in this book

This book includes a few Python modules for the best learning outcomes. If an NLP task can be performed by other libraries, such as scikit-learn or NLTK, I will show you the code examples for comparison. The libraries included in this book are detailed in the following sections.

spaCy

spaCy is by far the best production-level, open source library for NLP. It makes many processing tasks easy with reliable code and outcomes. If you work with a large volume of texts for text preprocessing, spaCy is an excellent choice. It is designed to be a simple and concise alternative to C.

It can perform a wide range of NLP operations well. These NLP operations include the following tasks:

  • Tokenization: This breaks text into individual words or tokens. To a computer, a sentence is just a string of characters. The string has to be separated into words.
  • Part-of-speech (PoS) tagging: This assigns grammatical labels to each word in a sentence. For example, the sentence “She loves the beautiful flower” has a pronoun (“she”), a verb (“loves”), an adjective (“beautiful”), and a noun (“flower”). The labeling for the pronoun, verb, adjective, and noun is called PoS tagging.
  • Named entity recognition (NER): This identifies named entities such as names, organizations, locations, and so on. For example, in the sentence “I went to New York City on July 4th,” the named entities would be “New York City” (a place), and “July 4th” (a date). It is worth mentioning that spaCy’s built-in NER models are based on the BERT architecture. As we will learn about BERT in this book, it is helpful to be aware of this.
  • Lemmatization: This reduces words to their base or dictionary form. We will learn more about lemmatization in Chapter 3, Text Wrangling and Preprocessing.
  • Rule-based matching: This can find sequences of words based on user-defined rules.
  • Word vectors: These represent words as numerical vectors. When two words become vectors, they can be compared in the vector space. Word embedding and vectorization is an important step in NLP. spaCy provides the functions to do so. We will learn about the concept and practice of word vectorization in Chapter 7, Using Word2Vec.

spaCy can be easily integrated with other libraries such as Gensim and NLTK. That’s why in many code examples you see that spaCy, Gensim, and NLTK are used together.

These are just some of the main capabilities of spaCy, and it offers many more features and functionalities for NLP tasks.

NLTK

NLTK is an open source Python library for natural language processing. It provides a suite of tools for working with text data, including tokenization, PoS tagging, and NER. It provides interfaces to over 50 corpora and lexical resources, such as WordNet. NLTK also includes a number of pre-trained models for tasks such as sentiment analysis and topic modeling. It is widely used in academia and industry for research and development in NLP. NLTK can perform a range of NLP tasks too, including PoS, NER, sentiment analysis, text classification, and text summarization.

Summary

This chapter provided a landscape view of the NLP topics covered in this book. We learned that the development of NLP was due to the success of NLU and NLG. Then, we surveyed the NLP techniques that are covered by Gensim. The main techniques include BoW, TF-IDF, LSA/LSI, Word2Vec, Doc2Vec, LDA, and Ensemble LDA. We were also introduced to BERTopic modeling. We then learned about the other two popular NLP Python libraries, spaCy and NLTK.

As we all know, a computer operates on zeros and ones but cannot comprehend the great works of Shakespeare. So, how do ChatGPT and other language models understand language? The very first step is to convert words to numerical values. The next chapter will teach you about text representation.

Questions

  1. Describe natural language processing (NLP).
  2. What is natural language understanding (NLU)?
  3. What is natural language generation (NLG)?
  4. List some of the NLP modeling techniques used by Gensim.
  5. List some of the most used NLP Python modules.
  6. Once you have answered the previous questions, let’s access https://chat.openai.com/ to search for answers. This time, let’s key in a question, called a “prompt,” to get answers. You are encouraged to experiment with ChatGPT with variations of the questions. For example, you can test the following for Question 1:
    1. “Please describe natural language processing.”
    2. “Please describe NLP to a high schooler.”
    3. “Please describe NLP in one paragraph.”
    4. “Please describe NLP with an analogy.”

References

  1. Wei, Low De (2022, December 2). This AI Chatbot Is Blowing People’s Minds. Here’s What It’s Been Writing. Bloomberg.com. https://www.bloomberg.com/news/articles/2022-12-02/chatgpt-openai-s-new-essay-writing-chatbot-is-blowing-people-s-minds?leadSource=uverify%20wall
  2. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv, abs/1706.03762.
  3. Kuo, Chris, (2023) Large Language Model Datasets, May 9, 2023, https://dataman-ai.medium.com/large-language-model-datasets-95df319a110
  4. Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations. https://arxiv.org/abs/1310.4546
  5. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
  6. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T.J., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv, abs/2005.14165.
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Advance your NLP skills with this comprehensive guide covering detailed explanations and code practices
  • Build real-world topical modeling pipelines and fine-tune hyperparameters to deliver optimal results
  • Adhere to the real-world industrial applications of topic modeling in medical, legal, and other fields
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Navigating the terrain of NLP research and applying it practically can be a formidable task made easy with The Handbook of NLP with Gensim. This book demystifies NLP and equips you with hands-on strategies spanning healthcare, e-commerce, finance, and more to enable you to leverage Gensim in real-world scenarios. You’ll begin by exploring motives and techniques for extracting text information like bag-of-words, TF-IDF, and word embeddings. This book will then guide you on topic modeling using methods such as Latent Semantic Analysis (LSA) for dimensionality reduction and discovering latent semantic relationships in text data, Latent Dirichlet Allocation (LDA) for probabilistic topic modeling, and Ensemble LDA to enhance topic modeling stability and accuracy. Next, you’ll learn text summarization techniques with Word2Vec and Doc2Vec to build the modeling pipeline and optimize models using hyperparameters. As you get acquainted with practical applications in various industries, this book will inspire you to design innovative projects. Alongside topic modeling, you’ll also explore named entity handling and NER tools, modeling procedures, and tools for effective topic modeling applications. By the end of this book, you’ll have mastered the techniques essential to create applications with Gensim and integrate NLP into your business processes.

Who is this book for?

This book is for data scientists and professionals who want to become proficient in topic modeling with Gensim. NLP practitioners can use this book as a code reference, while students or those considering a career transition will find this a valuable resource for advancing in the field of NLP. This book contains real-world applications for biomedical, healthcare, legal, and operations, making it a helpful guide for project managers designing their own topic modeling applications.

What you will learn

  • Convert text into numerical values such as bag-of-word, TF-IDF, and word embedding
  • Use various NLP techniques with Gensim, including Word2Vec, Doc2Vec, LSA, FastText, LDA, and Ensemble LDA
  • Build topical modeling pipelines and visualize the results of topic models
  • Implement text summarization for legal, clinical, or other documents
  • Apply core NLP techniques in healthcare, finance, and e-commerce
  • Create efficient chatbots by harnessing Gensim's NLP capabilities

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 27, 2023
Length: 310 pages
Edition : 1st
Language : English
ISBN-13 : 9781803244945
Category :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Oct 27, 2023
Length: 310 pages
Edition : 1st
Language : English
ISBN-13 : 9781803244945
Category :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 113.97
Natural Language Understanding with Python
€37.99
The Handbook of NLP with Gensim
€37.99
Generative AI with LangChain
€37.99
Total 113.97 Stars icon

Table of Contents

23 Chapters
Part 1: NLP Basics Chevron down icon Chevron up icon
Chapter 1: Introduction to NLP Chevron down icon Chevron up icon
Chapter 2: Text Representation Chevron down icon Chevron up icon
Chapter 3: Text Wrangling and Preprocessing Chevron down icon Chevron up icon
Part 2: Latent Semantic Analysis/Latent Semantic Indexing Chevron down icon Chevron up icon
Chapter 4: Latent Semantic Analysis with scikit-learn Chevron down icon Chevron up icon
Chapter 5: Cosine Similarity Chevron down icon Chevron up icon
Chapter 6: Latent Semantic Indexing with Gensim Chevron down icon Chevron up icon
Part 3: Word2Vec and Doc2Vec Chevron down icon Chevron up icon
Chapter 7: Using Word2Vec Chevron down icon Chevron up icon
Chapter 8: Doc2Vec with Gensim Chevron down icon Chevron up icon
Part 4: Topic Modeling with Latent Dirichlet Allocation Chevron down icon Chevron up icon
Chapter 9: Understanding Discrete Distributions Chevron down icon Chevron up icon
Chapter 10: Latent Dirichlet Allocation Chevron down icon Chevron up icon
Chapter 11: LDA Modeling Chevron down icon Chevron up icon
Chapter 12: LDA Visualization Chevron down icon Chevron up icon
Chapter 13: The Ensemble LDA for Model Stability Chevron down icon Chevron up icon
Part 5: Comparison and Applications Chevron down icon Chevron up icon
Chapter 14: LDA and BERTopic Chevron down icon Chevron up icon
Chapter 15: Real-World Use Cases Chevron down icon Chevron up icon
Assessments Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(6 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Om S Nov 27, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Dive into the world of Natural Language Processing (NLP) with 'The Handbook of NLP with Gensim,' a practical guide to unlocking the power of topic modeling. Learn essential techniques, from converting text into numerical values to advanced methods like Word2Vec and Doc2Vec. This resource provides hands-on strategies for building impactful applications. Whether you're a seasoned practitioner or new to NLP, gain practical insights and real-world skills effortlessly. Explore the depths of NLP with Gensim, making complex topics accessible. Elevate your understanding with clear explanations and code practices. Master the art of topic modeling, fine-tune hyperparameters, and excel in various industries. This guide is your key to advancing your NLP journey and becoming a leader in the field.
Amazon Verified review Amazon
H2N Nov 29, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A great guide on NLP for both beginners and seasoned professionals, especially in the era of ChatGPT and GPT-4. It offers a good exploration of NLP, from traditional methods to modernLLMs, highlighting key developments like Word2Vec and LSA. The book presents theoretical knowledge with practical applications with NLP concepts and model building using Python libraries like Gensim and spaCy, with a focus on Transformer-based BERTopic modeling. Emphasizing practical implementation for model deployment, it concludes with diverse NLP use cases, making it a comprehensive resource for anyone interested in mastering NLP.
Amazon Verified review Amazon
Albert Nov 21, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
During my pursuit of a master's degree at Columbia University, I had the privilege of studying anomaly detection under Professor Chris Kou. This course equipped me with the knowledge of utilizing both supervised and unsupervised machine learning techniques to identify potential anomalies within massive datasets. The applications covered ranged from loan defaults and credit card fraud to prescription fraud, showcasing the diverse practical scenarios where anomaly detection plays a crucial role.Now, it brings me great joy to see my professor, Chris Kou, authoring a new book titled "Handbook of NLP with Gensim: Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data." This book is a valuable resource for those seeking a deep understanding of natural language processing and aiming to unveil hidden patterns, themes, and valuable insights in textual data using the Gensim library.Chris Kou has always been an authority in the field, and his new book undoubtedly provides readers with profound and practical guidance, helping them comprehend and apply natural language processing techniques effectively. This handbook is not just an introduction to tools and technologies; it's a guide that empowers readers to navigate the sea of information, discover endless value, and gain insights. If you are interested in learning how to extract information, interpret themes, and uncover latent insights from textual data, then "Handbook of NLP with Gensim" is an indispensable companion.
Amazon Verified review Amazon
Chris Nov 22, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I will love to discuss more about the book with readers and present more use cases. I still remember the midnight when I was writing the chapter on text representation and listening to the song "Never enough" of the Greatest Showman. I use the lyric to show to readers how accessible it is to do NLP.
Amazon Verified review Amazon
Fi Dec 31, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As a former student of Professor Chris Kou at Columbia University, I am thrilled to witness the release of his latest masterpiece, "Handbook of NLP with Gensim." Having had the privilege of studying anomaly detection under Professor Kou's guidance, I knew I was in for a treat, and this book exceeded my already high expectations.This comprehensive guide transcends the theoretical realm, providing a hands-on approach to navigating the complex landscape of Natural Language Processing (NLP). Professor Kou skillfully demystifies NLP, making it accessible to both beginners and seasoned practitioners. The book not only equips readers with a profound understanding of the Gensim library but also empowers them with practical strategies applicable across diverse industries.The real-world applications presented in the book, spanning healthcare, finance, and e-commerce, provide a bridge between theory and industry practice. Whether you're a data scientist looking to enhance your topic modeling skills or a professional considering a career transition into NLP, this handbook is an indispensable resource.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.