Creating your own embedding using gensim
We will create an embedding using a small text corpus, called text8. The text8 dataset is the first 108 bytes the Large Text Compression Benchmark, which consists of the first 109 bytes of English Wikipedia [7]. The text8 dataset is accessible from within the gensim API as an iterable of tokens, essentially a list of tokenized sentences. To download the text8 corpus, create a Word2Vec model from it, and save it for later use, run the following few lines of code (available in create_embedding_with_text8.py
in the source code for this chapter):
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8")
model = Word2Vec(dataset)
model.save("data/text8-word2vec.bin")
This will train a Word2Vec model on the text8 dataset and save it as a binary file. The Word2Vec model has many parameters, but we will just use the defaults. In this case it trains a CBOW model (sg=0) with...