Training your own embeddings model
We can now train our own word2vec
model on a corpus. For this task, we will use the top 20 Project Guttenberg books, which includes The Adventures of Sherlock Holmes. The reason for this is that training a model on just one book will result in suboptimal results. Once we get more text, the results will be better.
Getting ready
You can download the dataset for this recipe from Kaggle: https://www.kaggle.com/currie32/project-gutenbergs-top-20-books. The dataset includes files in RTF format, so you will have to save them as text. We will use the same package, gensim
, to train our custom model.
We will use the pickle
package to save the model on disk. If you do not have it installed, install it by using pip:
pip install pickle
How to do it…
We will read in all 20 books and use the text to create a word2vec
model. Make sure all the books are located in one directory. Let's get started:
- Import the necessary packages...