So far in this book, we have looked at AI as a framework, to learn from vast amounts of data. For instance, if you are training an image classifier such as MNIST, you need labels for each image as to what digit they represent. Alternatively, if you are training a machine translation system, you need to provide a parallel aligned corpus of pairwise sentences, where each pair constitutes a sentence in a source language and an equivalent translation in a target language. Given such settings, it is possible to build an efficient deep learning-based AI system today.
However, one of the core challenges that still remains for mass-scale deployment and industrialization of such systems is the requirement of high-quality labeled data. Obtaining data is cheap, but curating and annotating is expensive as it requires manual intervention. One of the grand...