Understanding DL data challenges
In this section, we will discuss the data challenges at each stage of the DL life cycle, as illustrated in Figure 1.3. Essentially, DL is a data-centric AI, unlike symbolic AI where human knowledge can be used without lots of data. The challenges for data in DL are pervasive in all stages of the full life cycle:
- Data collection/cleaning/annotation: One of DL's first successes began with ImageNet (https://www.image-net.org/), where millions of images are collected and annotated according to the English nouns in the WordNet database (https://wordnet.princeton.edu/). This led to the successful development of pretrained DL models for computer vision such as VGG-NETS (https://pytorch.org/hub/pytorch_vision_vgg/), which can perform state-of-the-art image classification and is widely used for industrial and business applications. The main challenge of this kind of large-scale data collection and annotation is the unknown bias, which is hard to measure in this process (https://venturebeat.com/2020/11/03/researchers-show-that-computer-vision-algorithms-pretrained-on-imagenet-exhibit-multiple-distressing-biases/). Another example is the sales engagement platform Outreach (https://www.outreach.io/), where we can classify a potential buyer's sentiment. For instance, we might start by collecting email messages of 100 paid organizations to train a DL model. Following this, we would need to collect email messages from more organizations, either due to an accuracy requirement or expanded language coverage (such as from English only to other languages such as German and French). These many iterations of data collection and annotation will generate quite a lot of datasets. There is a tendency to just name the version of the dataset with hardcoded version numbers as part of a dataset filename such as the following:
MyCoolAnnotatedData-v1.0.csv MyCoolAnnotatedData-v2.0.csv MyCoolAnnotatedData-v3.0.csv MyCoolAnnotatedData-v4.0.csv
This seems to work until some changes are required in any one of the vX.0 datasets due to the need to correct annotation errors or remove email messages because of customer churn. Also, what happens if we need to combine several datasets together or perform some data cleaning and transformation to train a new DL model? What if we need to implement data augmentation to artificially generate some datasets? Evidently, simply changing the names of the files is neither scalable nor sustainable.
- Model development: We need to understand that the bias in the data we use to train/pretrain a DL model will reflect in the prediction when applying the model. While we do not focus on de-biasing data in this book, we must implement data versioning and data provenance as first-class artifacts when training and serving a DL model so that we can track all model experiments. When fine-tuning a pretrained model for our use cases, as we did earlier, we also need to track the versioning of the fine-tuning dataset we use. In our previous example, we use a variant of the BERT model to fine-tune the IMDb review data. While, in our first example, we did not care about the versioning or source of the data, this is important for a practical and real application. In summary, DL models need to link to a particular version of datasets using a scalable approach. We will provide solutions to this topic in this book.
- Model deployment and serving in production: This is for deploying into the production environment to serve online traffic. DL model serving latency is of particular importance and is interesting to collect at this stage. This might allow you to adjust the hardware environment used for inference.
- Model validation and A/B testing: The data we collect at this stage is mostly for user behavior metrics in the online experimentation environment (https://www.slideshare.net/pavel/ab-testing-ai-global-artificial-intelligence-conference-2019). Online data traffic also needs to be characterized in order to understand whether there is a statistical difference in the input to the model between offline experimentation and online experimentation. Only if we pass the A/B testing and validate that the model indeed works better than its previous version in terms of user behavior metrics do we roll out to production for all users.
- Monitoring and feedback loops: In this stage, note that the data will need to be continuously collected to detect data drift and concept drift. For example, in the buyer sentiment classification example discussed earlier, if buyers start to use terminology that is not encountered in the training data, the performance of the model could suffer.
In summary, data tracking and observability are major challenges in all stages of the DL life cycle.