Fundamentals of text embedding with NLKT and Gensim
In this section, we will go through the fundamentals of text embedding: tokenizing a book, embedding the tokens, and exploring the vector space we created.
Open Embedding_with_NLKT_Gensim.ipynb
in the chapter directory of the GitHub repository.
We will first install the libraries we will need.
Installing libraries
The program first installs the Natural Language Toolkit (NLTK):
!pip install --upgrade nltk -qq
import nltk
The NLTK will take us down to the token level as in Chapter 10, Investigating the Role of Tokenizers in Shaping Transformer Models.
We’ll use the punkt
sentence tokenizer:
nltk.download('punkt')
The program installs gensim
for the similarity tools:
!pip install gensim -qq
import gensim
print(gensim.__version__)
The output is the version:
4.3.2
The first step is to read the file.
1. Reading the text file
The program downloads a file containing...