Using active learning to reduce labeling time
Now that we've set up a labeling workflow, we need to think about scale. If our dataset has more than 5,000 records, it's likely that Ground Truth can learn how to label for us. (You need at least 1,250 labeled records for automatic labeling, but at least 5,000 is a good rule of thumb.) This happens in an iterative process, as shown in the following diagram:
When you create a labeling job using automatic labeling, Ground Truth will select a random sample of input data for manual labeling. If at least 90% of these items are labeled without error, Ground Truth will split the labeled data into a training and validation set. It will train a model and compute a confidence score, then attempt to label the remaining data. If the automatically generated labels are beneath the confidence threshold, it will refer them to workers for human review. This process repeats until...