Gensim and its NLP modeling techniques
Gensim is actively maintained and supported by a community of developers and is widely used in academic research and industry applications. It covers many important NLP techniques that make up the workforce of today’s NLP. That’s one of the reasons why I have developed this book to help data scientists.
Last year, I was at a company’s year-end party. The ballroom was filled with people standing in groups with their drinks. I walked around and listened for conversation topics where I could chime in. I heard one group talking about the FIFA World Cup 2022 and another group talking about stock markets. I joined the stock markets conversation. In that short moment, my mind had performed “word extractions,” “text summarization,” and “topic classifications.” These tasks are the core tasks of NLP and what Gensim is designed to do.
We perform serious text analyses in professional fields including legal, medical, and business. We organize similar documents into topics. Such work also demands “word extractions,” “text summarization,” and “topic classifications.” In the following sections, I will give you a brief introduction to the key models that Gensim offers so you will have a good overview. These models include the following:
- BoW and TF-IDF
- Latent semantic analysis/indexing (LSA/LSI)
- Word2Vec
- Doc2Vec
- Text summarization
- LDA
- Ensemble LDA
BoW and TF-IDF
Texts can be represented as a bag of words, which is the count frequency of a word. Consider the following two phrases:
- Phrase 1: All the stars we steal from the night sky
- Phrase 2: Will never be enough, never be enough, never be enough for me
The BoW presents the word count frequency as shown in Figure 1.1. For example, the word the in the first sentence appears twice, so it is coded as 2; the word be in the second sentence appears three times, so it is coded as 3:
Figure 1.1 – BoW encoding (also presented in the next chapter)
BoW uses the word count to reflect the significance of a word. However, this is not very intuitive. Frequent words may not carry special meanings depending on the type of document. For example, in clinical reports, the words physician, patient, doctor, and nurse appear frequently. The high frequency of these words may overshadow specific words such as bronchitis or stroke in a patient’s document. A better encoding system is to compare the relative word appearance in a document to its appearance throughout the corpus. TF-IDF is designed to reflect the importance of a word in a document by calculating its relevance to a corpus. We will learn the details of this in Chapter 2, Text Representation. At this moment, you just need to know that both BoW and TF-IDF are variations of text representation. They are the building blocks of NLP.
Although BoW and TF-IDF appear simple, they already have real-world applications in different fields. An important application of BoW and TF-IDF is to prevent spam emails from going to the inbox folder of an email account. Spam emails are ubiquitous, unavoidable, and quickly fill up the spam folder. BoW or TF-IDF will help to distinguish the characteristics of a spam email from regular emails.
LSA/LSI
Suppose you were a football fan and searching in a library using the keywords famous World Cup players, and that the system can only do exact key word match. The old computer system returned all articles that contained famous, world, cup, or players. It also returned a lot of other unrelated articles such as famous singer, 150 most famous poems, and world-renowned scientist. This is terrible, isn’t it? A simple keyword match cannot serve as a search engine.
Latent semantic analysis (LSA) was developed in the 1990s. It's an NLP solution that far surpasses naïve keyword matching and has become an important search engine algorithm. Prior to that, in 1988, an LSA-based information retrieval system was patented (US Patent #4839853, now expired) and named “latent semantic indexing,” so the technique is also called latent semantic indexing (LSI). Gensim and many other reports name LSA as LSI so as not to confuse LSA with LDA. In this book, I will adopt the same naming convention. In Chapter 6, Latent Semantic Indexing with Gensim, I will show you the code example to build an LSI model.
You can search with keywords such as the following:
This can return relevant news articles. One of the results is as follows:
Notice it searches by meaning but not by word matching.
Word2Vec
The Word2Vec technique developed by Mikolov et al. [4] in 2014 was a significant milestone in NLP. Its idea was ground-breaking — it embeds words or phrases from a text corpus as dense, continuous-valued vectors, hence the name word-to-vector. These vector representations capture semantic relationships and contextual information between words. Its applications are prevalent in many recommendation systems. Figure 1.2 shows other words that are close to the word iron including gunpowder, metals, and steel; words far from iron are organic, sugar, and grain:
Figure 1.2 – An overview of Word2Vec (also presented in Chapter 7)
Also, the relative distance of words measures the similarity of meanings. Word2Vec enables us to measure and visualize the similarities or dissimilarities of words or concepts. This is a fantastic innovation.
Can you see how this idea can also apply to movie recommendations? Each movie can be considered a word in someone’s watching history. I Googled the words movie recommendations, and it returned many movies under “Top picks for you”:
Figure 1.3 – An overview of Word2Vec as a movie recommendation system (also presented in Chapter 7)
Doc2Vec
Word2Vec represents a word with a vector. Can we represent a sentence or a paragraph with a vector? Doc2Vec is designed to do so. Doc2Vec transforms articles into vectors and enable semantic search for related articles. Doc2Vec has enabled many commercial products. For example, when you search for a job on LinkedIn.com or Indeed.com, you see similar job postings presented next to your target job posting. It is done by Doc2Vec. In Chapter 8, Doc2Vec with Gensim, you will build a real Doc2Vec model with code examples.
LDA
When documents are tagged by topic, we can retrieve the documents easily. In the old days, if you went to a library for books of a certain genre, you used the indexing system to find them. Now, with all the digital content, documents can be tagged systematically by topic modeling techniques.
The preceding library example may be an easier one, if compared to all sorts of social media posts, job posts, emails, news articles, or tweets. Topic models can tag digital content for effective searching or retrieving. LDA is an important topic modeling technique and has many commercial use cases. Figure 1.4 shows a snapshot of the LDA model output that we will build in Chapter 11, LDA Modeling:
Figure 1.4 – pyLDAvis (also presented in Chapter 11)
Each bubble represents a topic. The distance between any two bubbles represents the difference between the two topics. The red bars on top of the blue bars represent the estimated frequency of a word for a chosen topic. Documents on the same topic are similar in their content. If you are reading a document that belongs to Topic 75 and want to read more related articles, LDA can return other articles on Topic 75.
Ensemble LDA
The goal of topic modeling for a set of documents is to find topics that are reliable and reproducible. If you replicate the same modeling process for the same documents, you expect to produce the same set of topics. However, past experiments have shown that while iterations for the same model produce the same set of topics, some iterations can produce extra topics. This creates a serious issue in practice: Which model outcome is the correct one to use? This issue seriously limits the applications of LDA. So, we think of the ensemble method in machine learning. In machine learning, an ensemble is a technique that combines multiple individual models to improve predictive accuracy and generalization performance. Ensemble LDA builds many models to identify a core set of topics that is reliable and reproducible all the time. In Chapter 13, The Ensemble LDA for Model Stability, I will explain the algorithm with visual aids in more detail. We also will build our own model with code examples.