You're reading from The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781803244945

Length 310 pages

Edition 1st Edition

Tools

fastText

Concepts

Mobile Application Development

Author (1):

Chris Kuo

View More author details

Table of Contents (24) Chapters

Preface

1. Part 1: NLP Basics

2. Chapter 1: Introduction to NLP FREE CHAPTER

3. Chapter 2: Text Representation

4. Chapter 3: Text Wrangling and Preprocessing

5. Part 2: Latent Semantic Analysis/Latent Semantic Indexing

6. Chapter 4: Latent Semantic Analysis with scikit-learn

7. Chapter 5: Cosine Similarity

8. Chapter 6: Latent Semantic Indexing with Gensim

9. Part 3: Word2Vec and Doc2Vec

10. Chapter 7: Using Word2Vec

11. Chapter 8: Doc2Vec with Gensim

12. Part 4: Topic Modeling with Latent Dirichlet Allocation

13. Chapter 9: Understanding Discrete Distributions

14. Chapter 10: Latent Dirichlet Allocation

15. Chapter 11: LDA Modeling

16. Chapter 12: LDA Visualization

17. Chapter 13: The Ensemble LDA for Model Stability

18. Part 5: Comparison and Applications

19. Chapter 14: LDA and BERTopic

20. Chapter 15: Real-World Use Cases

21. Assessments

22. Index

Why subscribe?

23. Other Books You May Enjoy

What this book covers

Chapter 1, Introduction to NLP, is an introductory chapter that explains the development from Natural Language Understanding (NLU) and Natural Language Generation (NLG) to NLP. It briefs the core techniques including text pre-processing, LSA/LSI, Word2Vec, Doc2Vec, LDA, Ensemble LDA, and BERTopic. It presents the open source NLP modules Gensim, Scikit-learn, and Spacy.

Chapter 2, Text Representation, starts with the basic step of text representation. It explains the motivation from one-hot encoding to Bag-of-words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). It demonstrates how to perform BoW and TF-IDF with Gensim, Scikit-learn, and NLTK.

Chapter 3, Text Wrangling and Preprocessing, presents the essential text pre-processing tasks: (a) tokenization, (b) lowercase conversion, (c) stop words removal, (d) punctuation removal, (e) stemming, and (f) lemmatization. It guides you to perform the pre-processing tasks with Gensim, spaCy, and NLTK.

Chapter 4, Latent Semantic Analysis with scikit-learn, presents the theory of LSA/LSI. This chapter introduces Singular Vector Decomposition (SVD), Truncated SVD, and Truncated SVD’s application to LSA/LSI. This chapter uses Scikit-learn to illustrate the transition of Truncated SVD to LSA/LSI explicitly.

Chapter 5, Cosine Similarity, is dedicated to explaining this fundamental measure in NLP. Cosine similarity, among other metrics such as Euclidean distance or Manhattan distance, measures the similarity between embedded data in the vector space. This chapter also indicates the applications of cosine similarity for image comparison and querying.

Chapter 6, Latent Semantic Indexing with Gensim, builds an LSA/LSI model with Gensim. This chapter introduces the concept of coherence score that determines the optimal number of topics. It shows how to score new documents with the use of cosine similarity to add to an information retrieval tool.

Chapter 7, Using Word2Vec, introduces the milestone Word2Vec technique and its two neural network architectural variations: Continuous Bag-of-Word (CBOW) and Skip Gram (SG). It illustrates the concept and operation for word embedding in the vector space. It guides you to build a word2Vec model and prepares it as part of an informational retrieval tool. It visualizes word vectors of a Word2Vec model with t-SNE and TensorBoard (by TensorFlow). This chapter ends with the comparisons of Word2Vec with Doc2Vec, GloVe, and FastText.

Chapter 8, Doc2Vec with Gensim, presents the evolution from Word2Vec to Doc2Vec. It details the two neural network architectural variations: Paragraph Vector with Distributed Bag-of-words (PV-DBOW) and Paragraph Vectors with Distributed Memory (PV-DM). It guides you to build a Doc2Vec model and prepares it as part of an informational retrieval tool

Chapter 9, Understanding Discrete Distributions, introduces the discrete distribution family including Bernoulli, binomial, multinomial, beta, and Dirichlet distribution. Because the complex distributions are the generalization of the simple distributions, this sequence helps you to understand Dirichlet distribution. The fact that ‘Dirichlet’ is in the title of LDA tells us its significance. This chapter helps you understand LDA in the next chapter.

Chapter 10, Latent Dirichlet Allocation, presents the LDA algorithm, including the structural design of LDA, generative modeling, and Variational Expectation-Maximization.

Chapter 11, LDA Modeling, demonstrates how to build an LDA model, perform hyperparameter turning, and determine the optimal number of topics. You will learn the steps to apply an LDA model to score new documents as part of an informational retrieval tool.

Chapter 12, LDA Visualization, presents the visualization for LDA. This chapter starts with a design thinking for the rich content of a topic model. Then it shows how to use pyLADviz for visualization.

Chapter 13, The Ensemble LDA for Model Stability, investigates the root causes of the instability of LDA. It explains the Ensemble approach for LDA and the use of Checkback DBSCAN, a clustering algorithm, to deliver a stable set of topics.

Chapter 14, LDA and BERTopic, presents the BERTopic modeling technique that uses an LLM-based BERT algorithm for word embeddings, UMAP for dimensionality reduction for word embedding, HDBSCAN for topic clustering, c-TFIDF for word presentation for topics, and MMR to fine-tune the word representation for topics. It guides you through BERT modeling, visualization, and scoring new documents for topics.

Chapter 15, Real-World Use Cases, presents seven NLP projects in healthcare, medical, legal, finance, and social media. By learning these NLP solutions, you will be motivated to apply code notebooks of this book to perform similar jobs or apply to your future applications.

The rest of the chapter is locked

You're reading from The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Table of Contents (24) Chapters

What this book covers

Unlock this book and the full library FREE for 7 days

Authors (1)