You're reading from The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781803244945

Length 310 pages

Edition 1st Edition

Tools

fastText

Concepts

Mobile Application Development

Author (1):

Chris Kuo

View More author details

Table of Contents (24) Chapters

Preface

1. Part 1: NLP Basics

2. Chapter 1: Introduction to NLP FREE CHAPTER

3. Chapter 2: Text Representation

4. Chapter 3: Text Wrangling and Preprocessing

5. Part 2: Latent Semantic Analysis/Latent Semantic Indexing

6. Chapter 4: Latent Semantic Analysis with scikit-learn

7. Chapter 5: Cosine Similarity

8. Chapter 6: Latent Semantic Indexing with Gensim

9. Part 3: Word2Vec and Doc2Vec

10. Chapter 7: Using Word2Vec

11. Chapter 8: Doc2Vec with Gensim

12. Part 4: Topic Modeling with Latent Dirichlet Allocation

13. Chapter 9: Understanding Discrete Distributions

14. Chapter 10: Latent Dirichlet Allocation

15. Chapter 11: LDA Modeling

16. Chapter 12: LDA Visualization

17. Chapter 13: The Ensemble LDA for Model Stability

18. Part 5: Comparison and Applications

19. Chapter 14: LDA and BERTopic

20. Chapter 15: Real-World Use Cases

21. Assessments

22. Index

Why subscribe?

23. Other Books You May Enjoy

Performing word embedding with BoW and TF-IDF

Let’s first do BoW and TF-IDF. We learned how to prepare BoW and TF-IDF in Chapter 2, Text Representation. BoW is actually the count frequency of words, while its variation, TF-IDF, is designed to reflect the importance of a word in a document of a corpus.

We will first use the Dictionary class to build and manage dictionaries of terms (words or tokens). It creates a mapping between unique terms in a corpus and their integer IDs. This is actually the BoW:

from gensim.corpora import Dictionarygensim_dictionary = Dictionary()

Let’s examine the dictionary list object, gensim_dictionary. How many unique words are in it? Let’s check the length of this list to get the number of words:

len(gensim_dictionary)

We get the following output:

So, there are 40,360 words!

Now, we will create the BoW.

BoW

We create the BoW by using the .doc2bow() function:

bow_corpus = [gensim_dictionary.doc2bow...

The rest of the chapter is locked

You're reading from The Handbook of NLP with Gensim Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Table of Contents (24) Chapters

Performing word embedding with BoW and TF-IDF

BoW

Unlock this book and the full library FREE for 7 days

Authors (1)