Data-level techniques for text classification
Data imbalance, wherein certain classes in a dataset are underrepresented, is not just an issue confined to image or structured data domains. In NLP, imbalanced datasets can lead to biased models that might perform well on the majority class but are likely to misclassify underrepresented ones. To address this challenge, numerous strategies have been devised.
In NLP, data augmentation can boost model performance, especially with limited training data. Table 7.3 categorizes the various data augmentation techniques for text data:
Level |
Method |
Description |
Example techniques |
Character level |
Noise |
Introducing randomness at the character level |
Jumbling characters |
Rule-based |
... |