Training your own Word2Vec model in CBOW and Skip-Gram
I will show you how to build the model step by step. The steps are as follows:
- Load the data.
- Text preprocessing.
- Modeling.
- Save and load the model.
- Use the model.
The text preprocessing only consists of tokenization. In Figure 7.5 and 7.8, we have explained how text data is prepared to be the inputs and outputs. Gensim has taken care of this data preparation. All we need to do is text tokenization.
Load the data
Let’s continue to use the same AG’s news articles in order to save your learning time. This dataset has 120,000 news articles in the training dataset. Let’s load it:
import pandas as pdimport numpy as np pd.set_option('display.max_colwidth', -1) path = "/content/gdrive/My Drive/data/gensim" train = pd.read_csv(path + "/ag_news_train.csv")
Text preprocessing
Let’s apply the same preprocessing procedure. The tokenized...