What this book covers
Chapter 1, Learning NLP Basics, introduces the very basics of NLP. The recipes in this chapter show the basic preprocessing steps that are required for further NLP work. We show how to tokenize text, or divide it into sentences and words; assign parts of speech to individual words; lemmatize them, or get their canonical forms; and remove stopwords.
Chapter 2, Playing with Grammar, shows how to get grammatical information from text. This information could be useful in determining relationships between different entities mentioned in the text. We start by showing how to determine whether a noun is singular or plural. We then show how to get a dependency parse that shows relationships between words in a sentence. Then, we demonstrate how to get noun chunks, or nouns with their dependent words, such as adjectives. After that, we look at parsing out the subjects and objects of a sentence. Finally, we show how to use a regular-expression-style matcher to extract grammatical phrases in a sentence.
Chapter 3, Representing Text – Capturing Semantics, looks at different ways of representing text for further processing in NLP models. Since computers cannot deal with words directly, we need to encode them in vector form. In order to demonstrate the effectiveness of different methods of encoding, we first create a simple classifier and then use it with different encoding methods. We look at the following encoding methods: bag-of-words, N-gram model, TF-IDF, word embeddings, BERT, and OpenAI embeddings. We also show how to train your own bag-of-words model and demonstrate how to create a simple retrieval-augmented generation (RAG) solution.
Chapter 4, Classifying Texts, shows various ways of carrying out text classification, one of the most common NLP tasks. First, we show how to preprocess the dataset in order to prepare it for classification. Then, we demonstrate different classifiers, including a rule-based classifier, an unsupervised classifier via K-means, training an SVM for classification, training a spaCy model for text classification, and, finally, using OpenAI GPT models to classify texts.
Chapter 5, Getting Started with Information Extraction, shows how to extract information from text, another very important NLP task. We start off with using regular expressions for simple information extraction. We then look at how to use the Levenshtein distance to handle misspellings. Then, we show how to extract characteristic keywords from different texts. We look at how to extract named entities using spaCy, and how to train your own custom spaCy NER model. Finally, we show how to fine-tune a BERT NER model.
Chapter 6, Topic Modeling, shows how to determine topics of text using various unsupervised methods, including LDA, community detection with BERT embeddings, K-means clustering, and BERTopic. Finally, we use contextualized topic models that work with multilingual models and inputs.
Chapter 7, Visualizing Text Data, focuses on using various tools to create informative visualizations of text data and processing. We create graphic representations of the dependency parse, parts of speech, and named entities. We also create a confusion matrix plot and word clouds. Finally, we use pyLDAvis and BERTopic to visualize topics in a text.
Chapter 8, Transformers and Their Applications, provides an introduction to Transformers. This chapter begins by demonstrating how to transform text into a format suitable for internal processing by a Transformer model. It then explores techniques for text classification using pre-trained Transformer models. Additionally, the chapter delves into text generation with Transformers, explaining how to tweak the generation parameters to produce coherent and natural-sounding text. Finally, it covers the application of Transformers in language translation.
Chapter 9, Natural Language Understanding, covers NLP techniques that help infer the information contained in a piece of text. This chapter begins with a discussion on question-answering in both open and closed domains, followed by methods for answering questions from document sources using extractive and abstractive approaches. Subsequent sections cover text summarization and sentence entailment. The chapter concludes with explainability techniques, which demonstrate how models make classification decisions and how different parts of the text contribute to the assigned class labels.
Chapter 10, Generative AI and Large Language Models, introduces open source Large Language Models (LLMs) such as Mistral and Llama, demonstrating how to use prompts to generate text based on simple human-defined requirements. It further explores techniques for generating Python code and SQL statements from natural language instructions. Finally, it presents methods for utilizing a sophisticated closed source LLM from OpenAI to orchestrate custom task agents. These agents collaborate to answer complex questions requiring web searches and basic arithmetic to arrive at an end solution.
To get the most out of this book
You will need an understanding of the Python programming language and how to manage and install packages for it. Knowledge of Jupyter Notebook would be useful, though it is not required. For package management, the knowledge of poetry
package management is recommended, though you can make the examples work via pip
too. For recipes to be able to use GPUs (if present) in the system, ensure that the latest GPU device drivers are installed along with the CUDA/cuDNN dependencies.
Software/hardware covered in the book |
Operating system requirements |
Python 3.10 |
Windows, macOS, or Linux |
Poetry |
Windows, macOS, or Linux |
Jupyter Notebook (optional) |
Windows, macOS, or Linux |
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.