Text augmentation strategies
We discussed augmentation strategies for computer vision problems extensively in the previous chapter. By contrast, similar approaches for textual data are a less well-explored topic (as evidenced by the fact there is no single package like albumentations
). In this section, we demonstrate some of the possible approaches to addressing the problem.
Basic techniques
As usual, it is informative to examine the basic approaches first, focusing on random changes and synonym handling. A systematic study of the basic approaches is provided in Wei and Zou (2019) at https://arxiv.org/abs/1901.11196.
We begin with synonym replacement. Replacing certain words with their synonyms produces text that is close in meaning to the original, but slightly perturbed (see the project page at https://wordnet.princeton.edu/ if you are interested in more details, like where the synonyms are actually coming from):
def get_synonyms(word):
synonyms = set()...