Using weakly supervised labels to improve IMDb sentiment analysis
Sentiment analysis of movie reviews on the IMDb website is a standard task for classification-type Natural Language Processing (NLP) models. We used this data in Chapter 4 to demonstrate transfer learning with GloVe and VERT embeddings. The IMDb data set has 25,000 training examples and 25,000 testing examples. The dataset also includes 50,000 unlabeled reviews. In previous attempts, we ignored these unsupervised data points. Adding more training data will improve the accuracy of the model. However, hand labeling would be a time-consuming and expensive exercise. We'll use Snorkel-powered labeling functions to see if the accuracy of the predictions can be improved on the testing set.
Pre-processing the IMDb dataset
Previously, we used the tensorflow_datasets
package to download and manage the dataset. However, we need lower-level access to the data to enable writing the labeling functions. Hence, the...