Doc2Vec modeling with Gensim
Just to recap what I have described in Preface, this book uses the sampled AG’s corpus of news articles. This dataset is a smaller collection that sampled news articles on “world”, “sports”, “business”, and “Science”. It has been used extensively in many NLP modeling projects and is available in Kaggle, PyTorch, Huggingface, and TensorFlow. The data has four classes: Class “1” is news about “World”, Class “2” is news about “Sports”, “3” is “Business”, and “4” is “Sci/Tech”. I will walk you through the following tasks:
- Text preprocessing for Doc2Vec
- Modeling
- Saving the model
- Saving the training data
So let’s start with the first step.
Text preprocessing for Doc2Vec
Let’s apply the preprocessing procedure to the dataset:
import pandas...