What this book covers
Chapter 1, Introduction to NLP, is an introductory chapter that explains the development from Natural Language Understanding (NLU) and Natural Language Generation (NLG) to NLP. It briefs the core techniques including text pre-processing, LSA/LSI, Word2Vec, Doc2Vec, LDA, Ensemble LDA, and BERTopic. It presents the open source NLP modules Gensim, Scikit-learn, and Spacy.
Chapter 2, Text Representation, starts with the basic step of text representation. It explains the motivation from one-hot encoding to Bag-of-words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). It demonstrates how to perform BoW and TF-IDF with Gensim, Scikit-learn, and NLTK.
Chapter 3, Text Wrangling and Preprocessing, presents the essential text pre-processing tasks: (a) tokenization, (b) lowercase conversion, (c) stop words removal, (d) punctuation removal, (e) stemming, and (f) lemmatization. It guides you to perform the pre-processing tasks with Gensim, spaCy, and NLTK.
Chapter 4, Latent Semantic Analysis with scikit-learn, presents the theory of LSA/LSI. This chapter introduces Singular Vector Decomposition (SVD), Truncated SVD, and Truncated SVD’s application to LSA/LSI. This chapter uses Scikit-learn to illustrate the transition of Truncated SVD to LSA/LSI explicitly.
Chapter 5, Cosine Similarity, is dedicated to explaining this fundamental measure in NLP. Cosine similarity, among other metrics such as Euclidean distance or Manhattan distance, measures the similarity between embedded data in the vector space. This chapter also indicates the applications of cosine similarity for image comparison and querying.
Chapter 6, Latent Semantic Indexing with Gensim, builds an LSA/LSI model with Gensim. This chapter introduces the concept of coherence score that determines the optimal number of topics. It shows how to score new documents with the use of cosine similarity to add to an information retrieval tool.
Chapter 7, Using Word2Vec, introduces the milestone Word2Vec technique and its two neural network architectural variations: Continuous Bag-of-Word (CBOW) and Skip Gram (SG). It illustrates the concept and operation for word embedding in the vector space. It guides you to build a word2Vec model and prepares it as part of an informational retrieval tool. It visualizes word vectors of a Word2Vec model with t-SNE and TensorBoard (by TensorFlow). This chapter ends with the comparisons of Word2Vec with Doc2Vec, GloVe, and FastText.
Chapter 8, Doc2Vec with Gensim, presents the evolution from Word2Vec to Doc2Vec. It details the two neural network architectural variations: Paragraph Vector with Distributed Bag-of-words (PV-DBOW) and Paragraph Vectors with Distributed Memory (PV-DM). It guides you to build a Doc2Vec model and prepares it as part of an informational retrieval tool
Chapter 9, Understanding Discrete Distributions, introduces the discrete distribution family including Bernoulli, binomial, multinomial, beta, and Dirichlet distribution. Because the complex distributions are the generalization of the simple distributions, this sequence helps you to understand Dirichlet distribution. The fact that ‘Dirichlet’ is in the title of LDA tells us its significance. This chapter helps you understand LDA in the next chapter.
Chapter 10, Latent Dirichlet Allocation, presents the LDA algorithm, including the structural design of LDA, generative modeling, and Variational Expectation-Maximization.
Chapter 11, LDA Modeling, demonstrates how to build an LDA model, perform hyperparameter turning, and determine the optimal number of topics. You will learn the steps to apply an LDA model to score new documents as part of an informational retrieval tool.
Chapter 12, LDA Visualization, presents the visualization for LDA. This chapter starts with a design thinking for the rich content of a topic model. Then it shows how to use pyLADviz
for visualization.
Chapter 13, The Ensemble LDA for Model Stability, investigates the root causes of the instability of LDA. It explains the Ensemble approach for LDA and the use of Checkback DBSCAN, a clustering algorithm, to deliver a stable set of topics.
Chapter 14, LDA and BERTopic, presents the BERTopic modeling technique that uses an LLM-based BERT algorithm for word embeddings, UMAP for dimensionality reduction for word embedding, HDBSCAN for topic clustering, c-TFIDF for word presentation for topics, and MMR to fine-tune the word representation for topics. It guides you through BERT modeling, visualization, and scoring new documents for topics.
Chapter 15, Real-World Use Cases, presents seven NLP projects in healthcare, medical, legal, finance, and social media. By learning these NLP solutions, you will be motivated to apply code notebooks of this book to perform similar jobs or apply to your future applications.