Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Handbook of NLP with Gensim

You're reading from   The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Arrow left icon
Product type Paperback
Published in Oct 2023
Publisher Packt
ISBN-13 9781803244945
Length 310 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Chris Kuo Chris Kuo
Author Profile Icon Chris Kuo
Chris Kuo
Arrow right icon
View More author details
Toc

Table of Contents (24) Chapters Close

Preface 1. Part 1: NLP Basics
2. Chapter 1: Introduction to NLP FREE CHAPTER 3. Chapter 2: Text Representation 4. Chapter 3: Text Wrangling and Preprocessing 5. Part 2: Latent Semantic Analysis/Latent Semantic Indexing
6. Chapter 4: Latent Semantic Analysis with scikit-learn 7. Chapter 5: Cosine Similarity 8. Chapter 6: Latent Semantic Indexing with Gensim 9. Part 3: Word2Vec and Doc2Vec
10. Chapter 7: Using Word2Vec 11. Chapter 8: Doc2Vec with Gensim 12. Part 4: Topic Modeling with Latent Dirichlet Allocation
13. Chapter 9: Understanding Discrete Distributions 14. Chapter 10: Latent Dirichlet Allocation 15. Chapter 11: LDA Modeling 16. Chapter 12: LDA Visualization 17. Chapter 13: The Ensemble LDA for Model Stability 18. Part 5: Comparison and Applications
19. Chapter 14: LDA and BERTopic 20. Chapter 15: Real-World Use Cases 21. Assessments 22. Index 23. Other Books You May Enjoy

Introduction to NLP

“Why do we need NLP?” You may ask this question as you've witnessed the advancement of natural language processing (NLP) in recent years. Let’s see how NLP helped a well-established investment firm named "Harmony Investments." For decades, Harmony Investments had been renowned for its astute financial strategies and portfolio management, ranging from stocks and bonds to real estate and alternative investments. However, the sheer volume and variety of data sources, including news articles, earnings reports, social media posts, and financial statements, made it nearly impossible to manually analyze all the information. The firm's analysts were spending an excessive amount of time collecting and reviewing data. Recognizing the need for a more efficient and data-driven approach, the firm partnered with a leading AI solutions provider to implement NLP-driven solutions into their business operations. They used NLP algorithms to review news articles, press releases, and social media platforms in real time. This analysis enabled the firm to react swiftly. They used NLP tools that automatically summarized lengthy earning reports. This reduced the time the analysts spent on manual document review. They used NLP-powered sentiment analysis to gauge public sentiment surrounding specific stocks or market segments. Analysts had more time for strategic research and developing innovative investment strategies. As a result, Harmony Investments not only retained its reputation as a leading investment firm but also attracted new clients and expanded its portfolio.

Joe is a data scientist who is new to NLP. He and his data analyst colleague, Jacob, are interested in learning NLP techniques. They want to acquire the NLP techniques that can deliver the NLP benefits as discussed. They have certainly heard of ChatGPT and all the news about large language models (LLMs). They want to learn NLP systematically, from concepts to practice, and want to find a textbook that can bridge them to LLMs without diving into LLMs first. If you are like Joe or Jacob, then this book is for you.

A fundamental step in NLP for computers to understand texts is text representation, which convert a collection of text documents into numerical values. Each document is represented as a vector in a high-dimensional space, where each dimension corresponds to a unique word in the entire corpus. This helps computers understand what words mean and how they relate to each other in sentences. This book starts with bag-of-words (BoW), bag-of-N-grams, term frequency-inverse document frequency (TF-IDF). An advance to text representation is the word embedding techniques. Word embeddings are dense vector representations of words that capture semantic relationships between words based on their context in a large dataset. Word embeddings, like Word2Vec, create continuous vector representations where words with similar meanings have similar vector representations, and they capture semantic and syntactic relationships.

Topic modeling is a significant NLP subject. It classifies documents into topics for document retrieval, categorization, tagging, or annotation. This book gives more insight into the milestone topic modeling technique, Latent Dirichlet Allocation (LDA). In addition, another milestone topic modeling technique is BERTopic. Let me briefly describe the development history of Bidirectional Encoder Representations from Transformers (BERT). The seminal paper “Attention is all you need” by Vaswani et al. [2] enables many transformer-based word embeddings and LLMs. One of the word embeddings is BERT. Can we do topic modeling to classify documents based on BERT word embeddings? That’s the origin of BERTopic. I have included BERTopic in this book together with LDA so you get to see the differences. This will provide a bridge to the transformer-based NLP techniques.

This book is a practical handbook with code snippets. I will cover many techniques in the Gensim library. Gensim is an open source Python library for topic modeling, document clustering, and other unsupervised learning tasks on collections of textual documents. It provides a high-level interface for building and training a variety of models. Gensim stands for generate similar. It finds the similarities between documents to summarize texts or to classify documents into topics.

In this chapter, we will cover the following topics:

  • Introduction to natural language processing
  • NLU + NLG = NLP
  • Gensim and its NLP modeling techniques
  • Topic modeling with BERTopic
  • Common NLP Python modules included in this book

After completing this chapter, you will get to know the development history of NLP. You will be able to explain the key NLP techniques that Gensim covers. You will also understand other popular NLP Python libraries that are often used together.

You have been reading a chapter from
The Handbook of NLP with Gensim
Published in: Oct 2023
Publisher: Packt
ISBN-13: 9781803244945
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image