Training your own embeddings model
We can now train our own word2vec
model on a corpus. This model is a neural network that predicts a word when given a sentence with words blanked out. The byproduct of the neural network training is the vector representation for each word in the training vocabulary. For this task, we will continue using the Rotten Tomatoes reviews. The dataset is not very large, so the results are not as good as they could be with a larger collection.
Getting ready
We will use the gensim
package for this task. It should be installed as part of the poetry
environment.
How to do it…
We will create the dataset and then train the model on the data. We will then test how it performs:
- Import the necessary packages and functions:
import gensim from gensim.models import Word2Vec from datasets import load_dataset from gensim import utils
- Load the training data and check its length:
train_dataset = load_dataset("rotten_tomatoes", split...