Weakly supervised labeling with Snorkel
The IMDb dataset has 50,000 unlabeled reviews. This is double the size of the training set, which has 25,000 labeled reviews. As explained in the previous section, we have reserved 23,000 records from the training data in addition to the unsupervised set for weakly supervised labeling. Labeling records in Snorkel is performed via labeling functions. Each labeling function can return one of the possible labels of abstain from labeling. Since this is a binary classification problem, corresponding constants are defined. A sample labeling function is also shown. All the code for this section can be found in the notebook titled snorkel-labeling.ipynb
:
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
from snorkel.labeling.lf import labeling_function
@labeling_function()
def time_waste(x):
if not isinstance(x.review, str):
return ABSTAIN
ex1 = "time waste"
ex2 = "waste of time"
if ex1 in x.review.lower() or ex2...