Data augmentation using word2vec
One way to regularize a model and get better performance is to have more data. Collecting data is not always easy or possible, but synthetic data can be an affordable way to improve performance. We’ll do this in this recipe.
Getting ready
Using word2vec embeddings, you can generate new, synthetic data that has a close semantic meaning. By doing this, it is fairly easy for a given word to get the most similar words in a given vocabulary.
In this recipe, using word2vec and a few parameters, we’ll see how we can generate new sentences with a close semantic meaning. We will only apply it to a given sentence as an example and propose how to integrate it into a full training pipeline.
The only required libraries are numpy
and gensim
, both of which can be installed with pip install
numpy gensim
.
How to do it…
Here are the steps to complete this recipe:
- The first step is to import the necessary libraries –...