Managing risks
Deep learning systems are exposed to a multitude of risks starting early, from inception to system adoption. Usually, most people who are assigned to work on a deep learning project are responsible only for a specific stage of the machine learning life cycle, such as model development or data preparation. This practice can be detrimental when the work of one stage in the life cycle generates problems at a later stage, which often happens in cases where the team members involved have little sense of the bigger picture in play. Risks involved in a deep learning system generally involve interactions between stages in the machine learning life cycle and follow the concept of Garbage in, garbage out. Making sure everyone working on building the system has a sense of accountability for what eventually gets outputted from the entirety of the system instead of just the individual stages is one of the foundational keys to managing risks in a machine learning system.
But what are the risks? Let’s start with something that can be handled way before anything tangible is made – something that happens when a use case is evaluated for worthiness.
Ethical and regulatory risks
Deep learning can be applied in practically any industry but some of the hardest industries to get deep learning adopted in are the highly regulated industries, such as the medical and financial industries. The regulations that are imposed in these industries ultimately determine what a deep learning system can or cannot do in the industry. Regulations are mostly introduced by governments and most commonly involve ethical and legal considerations. In these highly regulated industries, it is common to experience audits being conducted monthly, weekly, or even daily to make sure the companies are compliant with the regulations imposed. One of the main reasons certain industries are regulated more aggressively is that the repercussions of any actions in these industries bear a heavy cost to the well-being of the people or the country. Deep learning systems need to be built in a way that they will be compliant with these regulations to avoid the risk of facing decommission, the risk of not getting adopted at all, and, worst of all, the risk of getting a huge fine by regulatory officials.
At times, deep learning models can perform way better than their human counterparts but, on the other side, no deep learning model is perfect. Humans make mistakes, everybody knows that for a fact, but another reality we need to realize is that machines undoubtedly also make mistakes. The highest risk is when humans trust the machines so much that they give 100% trust in them to make decisions! So, how can we account for these mistakes and who will be responsible for them?
Let’s see what we humans do in situations without machines. When the stakes are high, important decisions always go through a hierarchy of approvals before a decision can be final. These hierarchies of approvals signal the need to make important decisions accountable and reliable. The more approvals we get for a decision, the more we can say that we are confident in making that important decision. Some examples of important decisions that commonly require a hierarchy of approvals include the decision to hire an employee in a company, the decision on insurance premiums to charge, or the decision of whether to invest money into a certain business. With that context in mind, deep learning systems need to have similar workflows for obtaining approvals when a deep learning model makes a certain predictive decision in high-stakes use cases. These workflows can include any form of insight, and an explanation of the predictions can be used to make it easier for domain experts to judge the validity of the decisions. Adding a human touch makes the deep learning system many times more ethical and trustable enough to be part of the high-stake decision workflow.
Let’s take a medical industry use case – for instance, a use case to predict lung disease through X-ray scans. From an ethical standpoint, it is not right that a deep learning model can have the complete power to strip a person’s hope of life by predicting extremely harmful diseases such as end-stage lung cancer. If the predicted extreme disease is misdiagnosed, patients may have grieved unnecessarily or spent an unnecessary amount of money on expensive tests to verify the claims. Having an approval workflow in the deep learning system allows doctors to use these results as an assistive method would solve the ethical concerns of using an automated decisioning system.
Business context mismatch
Reiterating the previous point in the Defining success section, aligning the desired deep learning input data and target label choice to the business context and how the target predictions are consumed makes deep learning systems adoptable. The risk involved here is when the business value is either not properly defined or not matched properly by the deep learning system. Even when the target is appropriate for the business context, how these predictions are consumed holds the key to deriving value. Failing to match the business context in any way simply risks the rejection of the systems.
Understanding the business context involves understanding who the targeted user groups are and who they are not. A key step in this process is documenting and building user group personas. User personas hold information about their workflows, history, statistics, and concerns that should provide the context needed to build a system that aligns with the specific needs of potential users. Don’t be afraid to conduct market research and make proper validation of the targeted user needs in the process of building these user personas. After all, it takes a lot of effort to build a deep learning system and it would not be great to waste time building a system that nobody wants to use.
Data collection and annotation risks
Your machine learning model will only be as good as the quality of your data. Deep learning requires a substantially larger quantity of data compared to machine learning. Additionally, very often for new deep learning use cases, data is not available off the shelf and must be manually collected and meticulously annotated. Making sure that the data is collected and annotated in a way that ensures quality is upheld is a very hard job.
The strategies and methods that are used to collect and annotate data for deep learning vary across use cases. Sometimes, the data collection and annotation process can happen at the same time. This can happen either when there are natural labels or when a label is prespecified before data is collected:
- Natural labels are labels that usually come naturally after some time passes. For example, this can be the gender of a baby through an ultrasound image; another use case is the price of a house as a label with the range of images of the property as the input data.
- Prespecified labels are labels where the input data is collected with predetermined labels. For example, this can be the gender label of some speech data that was collected in a controlled environment just for building a machine learning model or the age of a person in a survey before taking a facial photo shot to build a machine learning model.
These two types of labels are relatively safe from risks since the labels are likely to be accurate.
A final method of data collection and data annotation that poses a substantial amount of risk is when the labels are post-annotated after the data is collected. Post annotation requires some form of domain knowledge about the characteristics of the desired label and does not always result in a 100% truth value due to human evaluation errors. Unconscious labeling errors happen when the labeler simply makes an error by accident. Conscious labeling errors, on the other hand, happen when the labeler actively decides to make a wrong label decision with intent. This intent can be from a strong or loosely rooted belief tied to a certain pattern in the data, or just a plain error made purposely for some reason. The risk being presented here is the potential of the labels being mistakenly annotated. Fortunately, there are a few strategies that can be executed to eliminate these risks; these will be introduced in the following paragraphs.
Unconscious labeling errors are hard to circumvent as humans are susceptible to a certain amount of error. However, different people generally have varying levels of focus and selective choice of the labelers could be a viable option for some. If the labelers are hired with financial remuneration just to annotate the labels for your data, another viable option is to periodically provide data that is already labeled in secret in between providing data that needs labeling for the labelers to label the data. This strategy allows you to evaluate the performance of individual labelers and provide compensation according to the accuracy of their annotation work (yes, we can use metrics that are used to evaluate machine learning models too). As a positive side effect, labelers also would be incentivized to perform better in their annotation work.
Conscious labeling errors are the most dangerous risk here due to their potential to affect the quality of the entire dataset. When a wrong consensus on the pattern of data associated with any target label is used for the entire labeling process, the legitimacy of the labels won’t be discovered until the later stages of the machine learning life cycle. Machine learning models are only equipped with techniques to learn the patterns required to map the input data to the provided target labels and will likely be able to perform very well, even in conditions where the labels are incorrect. So long as there is a pattern that the labeler enforced during the labeling process, no matter right or wrong, machine learning models will do their best to learn the patterns needed to perform the exact input-to-output mapping. So, how do we mitigate this risk? Making sure ideologies are defined properly with the help of domain experts plays an important part in the mitigation process, but this strategy by itself holds varying levels of effectiveness, depending on the number of domain experts involved, the expertise level of the domain experts themselves, the number of labelers, and the varying degrees of anomalies or quality of the collected data.
Even domain experts can be wrong at times. Let’s take doctors as an example – how many times have you heard about a doctor giving wrong prescriptions and wrong medical assessments? How about the times when you misheard someone’s speech and had to guess what the other person just said? In the last example, the domain expert is you, and the domain is language speech comprehension. Additionally, when the number of labelers exceeds one person and forms a labeling team, one of the most prominent risks that can happen is a mismatch between different preferred domain ideologies to label the data. Sometimes, this happens because of the different variations of a certain label that can exist in a digital format or the existence of confusing patterns that can deter the analytical capabilities of the labeler. Sometimes, this can also happen due to the inherent bias the labeler or the domain expert has toward a certain label or input data. Subsequently, bias in the data creates bias in the model and creates ethical issues that demote trust in the decisions of a machine learning model. When there is a lack of trust in a machine learning model, the project will fail to be adopted and will lose its business value. The topic of bias, fairness, and trust will be discussed more extensively in Chapter 13, Exploring Bias and Fairness, which will elaborate on its origins and ways to mitigate it.
Data with wrong labels that still have correct labels are called noisy labels. Fortunately, there are methods in deep learning that can help you work your way around noisy labels, such as weakly supervised learning. However, remember that it’s always a good idea to fix the issue at the source, which is when the data is collected and annotated, rather than after. Let’s dive into another strategy we can use to make the labeling process less risky. The data labeling process for deep learning projects usually involves using a software tool that allows the specific desired labels to be annotated. Software tools make annotation work faster and easier, and using a labeling collaboration software tool can make annotation work more error-proof. A good collaboration-based labeling tool will allow labelers to align their findings of the data with each other in some way, which will promote a common ideology when labeling. For instance, automatically alerting the entire labeler team when an easily misinterpreted pattern is identified can help prevent more data from being mislabeled and make sure all the previously related data gets rereviewed.
As a final point here, not as a risk, unlabeled data presents huge untapped potential for machine learning. Although it lacks specific labels to achieve a particular goal, there are inherent relationships between the data that can be learned. In Chapter 9, Exploring Unsupervised Deep Learning, we will explore how to use unsupervised learning to leverage this data as a foundation for subsequent supervised learning, which forms the workflow more widely known as semi-supervised learning.
Next, we will go through the next risk related to security in the consumption of data for machine learning models.
Data security risk
Security, in the context of machine learning, relates to preventing the unauthorized usage of data, protecting the privacy of data, and preventing events or attacks that are unwanted relating to the usage of the data. The risk here is when data security is compromised and can result in failure to comply with regulatory standards, the downfall of the model’s performance, or the corruption and destruction of business processes. In this section, we will go through four types of data security risk, namely sensitive data handling, data licensing, software licensing, and adversarial attacks.
Sensitive data handling
Some data is more sensitive than others and can be linked to regulatory risks. Sensitive data gives rise to data privacy regulations that govern the usage of personal data in different jurisdictions. Specifics of regulations vary but they generally revolve around lawfully protecting the rights of the usage of personal data and requiring consent for any actions taken on these types of data along with terms required when using such data. Examples of such regulations are the General Data Protection Regulation (GDPR), which covers the European Union, the Personal Data Protection Act (PDPA), which covers Singapore, the California Consumer Privacy Act (CCPA), which covers only the state of California in the United States, and the Consumer Data Protection Act (CDPA), which covers the state of Virginia in the United States. This means that you can’t just collect data that is categorized as personal, annotate it, build a model, and deploy it without adhering to these regulations as doing so would be considered a crime in some of these jurisdictions. Other than requesting consent, one of the common methods that’s used to mitigate this risk is to anonymize data so that the data can’t be identified by any single person. However, anonymization has to be done reliably so that the key general information is maintained while reliably removing any possibility of reconstructing the identity of a person. Making sure sensitive and personal data is handled properly goes a long way in building trust in the decisions of a machine learning model. Always practice extreme caution in handling sensitive and personal data to ensure the longevity of your machine learning project.
Data and software licensing
Particularly for deep learning projects, a lot of data is required to build a good quality model. The availability of publicly available datasets helps shorten the time and complexity of a project by partially removing the cost and time needed to collect and label data. However, most datasets, like software, have licenses associated with them. These licenses govern how the data associated with the license can be used. The most important criterion of data license about machine learning models for business problems is whether the data allows for commercial usage or not. As most business use cases are considered to be commercial use cases due to profits derived from them, datasets with a license that prevents commercial usage cannot be used. Examples of data licenses with terms that prevent commercial usage are all derivatives of the Creative Commons Attribution-NonCommercial (CC-BY-NC) license. Similarly, open sourced software also poses a risk to deep learning projects. Always make sure you triple-check the licenses before using any publicly available data or software for your project. Using data or code that has terms that prevent commercial usage in your commercial business project puts your project at risk of being fined or sued for license infringement.
Adversarial attacks
When an application of machine learning is widely known, it exposes itself to targeted attacks meant to maliciously derail and manipulate the decisions of a model. This brings us to a type of attack that can affect a deployed deep learning model, called an adversarial attack. Adversarial attacks are a type of attack that involves manipulating the data input in a certain way to affect the predictions from the machine learning model. The most common adversarial attacks are caused by adversarial data examples that are modified from the actual input data in a way that they appear to maintain their legitimacy as input data but are capable of skewing the model’s prediction. The level of risk of such an attack varies across different use cases and depends on how much a user can interact with the system that eventually passes the input data to the machine learning model. One of the most widely known adversarial examples for a deep learning model is an optimized image that looks like random color noise and, when used to perturb pixel values of another image by overlaying its own pixel values, generates an adversarial example. The original image that’s obtained after the perturbation looks as though it’s untouched visually to a human but is capable of producing erroneous misclassification. The following figure shows an example of this perturbation as part of a tutorial from Chapter 14, Analyzing Adversarial Performance. This figure depicts the result of a neural network called ResNet50, trained to classify facial identity, that, when given a facial image, correctly predicted the right identity. However, when added together with another noise image array that was strategically and automatically generated just with access to the predicted class probabilities, the model mistakenly predicts the identity of the facial image, even when the combined image looks visually the same as the original image:
Figure 1.13 – Example of a potential image-based adversarial attack
Theoretically, adversarial attacks can be made using any type of data and are not limited to images. As the stakes of machine learning use cases grow higher, it’s more likely that attackers are willing to invest in a research team that produces novel adversarial examples.
Some of the adversarial examples and the methods that generate them rely on access to the deep learning model itself so that a targeted adversarial example can be made. This means that it is not remotely possible for a potential attacker to be able to succeed in confusing a model created by someone else unless they have access to the model. However, many businesses utilize publicly available pre-trained models as-is and apply transfer learning methods to reduce the amount of work needed to satisfy a business use case. Any publicly available pre-trained models will also be available to the attackers, allowing them to build targeted adversarial examples for individual models. Examples of such pre-trained models include all the publicly available ImageNet pre-trained convolutional neural network models and weights.
So, how do we attempt to mitigate this risk?
One of the methods we can use to mitigate the risk of an adversarial attack is to train deep learning models with examples from the known set of methods from public research. By training with the known adversarial examples, the model will learn how to ignore the adversarial information through the learning process and be unfazed by such examples during validation and inference time. Evaluating different variations of the adversarial examples also helps set expectations on when the model will fail.
In this section, we discussed an in-depth take on the security issues when dealing with and consuming data for machine learning purposes. In Chapter 14, Analyzing Adversarial Performance, we will go through a more in-depth practical evaluation of adversarial attack technologies for different data modalities and how to mitigate them in the context of deep learning. Next, we will dive into another category of risk that is at the core of the model development process.
Overfitting and underfitting
During the model development process, which is the process of training, validating, and testing a machine learning model, one of the most foundational risks to handle is overfitting a model and underfitting a model.
Overfitting is an event where a machine learning model becomes so biased toward the provided training dataset examples and learned patterns that it can only exclusively distinguish examples that belong in the training dataset while failing to distinguish any examples in the validation and testing dataset.
Underfitting, on the other hand, is an event where a machine learning model fails to capture any patterns of the provided training, validation, and testing dataset.
Learning a generalizable pattern to output mapping capability is the key to building a valuable and usable machine learning model. However, there is no silver bullet to achieving a nicely fitted model, and very often, it requires iterating between the data preparation, model development, and deliver model insights stages.
Here are some tips to prevent overfitting and ensure generalization in the context of deep learning:
- Augment your data in expectation of the eventual deployment conditions
- Collect enough data to cover every single variation possible of your targeted label
- Collaborate with domain experts and understand key indicators and patterns
- Use cross-validation methods to ensure the model gets evaluated fairly on unseen data
- Use simpler neural network architectures whenever possible
Here are some tips to prevent underfitting:
- Evaluate a variety of different models
- Collaborate with domain experts and understand key indicators and patterns
- Make sure the data is clean and has as low an error as possible
- Make sure there is enough data
- Start with a small set of data inputs when building your model and build your way up to more complex data to ensure models can fit appropriately on your data
In this section, we have discussed issues while building a model. Next, we will be discussing a type of risk that will affect the built model after it has ben trained.
Model consistency risk
One of the major traits of the machine learning process is that it is cyclical. Models get retrained constantly in hopes of finding better settings. These settings can be a different data partitioning method, a different data transformation method, a different model, or the same model with different model settings. Often, the models that are built need to be compared against each other fairly and equitably. Model consistency is the one feature that ensures that a fair comparison can be made between different model versions and performance metrics. Every single process before a model is obtained is required to be consistent so that when anybody tries to execute the same processes with the same settings, the same model should be obtainable and reproducible. We should reiterate that even when some processes require randomness when building the model, it needs to be randomly deterministic. This is needed so that the only difference in the setup is the targeted settings and nothing else.
Model consistency doesn’t stop with just the reproducibility of a model – it extends to the consistency of the predictions of a model. The predictions should be the same when predicted using a different batch size setting, and the same data input should always produce the same predictions. Inconsistencies in model predictions are a major red flag that signals that anything the model produces is not representative of what you will get during deployment and any derived model insights would be misinformation.
To combat model consistency risk, always make sure your code produces consistent results by seeding the random number generator whenever possible. Always validate model consistency either manually or automatically in the model development stage after you build the model.
Model degradation risk
When you have built a model, verified it, and demonstrated its impact on the business, you then take it into the model deployment stage and position it for use. One mistake is to think that this is the end of the deep learning project and that you can take your hands off and just let the model do its job. Sadly, most machine learning models degrade, depending on the level of generalization your model achieved during the model development stage on the data available in the wild. A common scenario that happens when a model gets deployed is that, initially, the model’s performance and the characteristics of the data received during deployment stay the same, and change over time. Time has the potential to change the conditions of the environment and anything in the world. Machines grow old, seasons change, and people change, and expecting that conditions and variables surrounding the machine learning model will change can allow you to make sure models stay relevant and impactful to the business.
How a model can degrade can be categorized into three categories, namely data drift, concept drift, and model drift. Conceptually, drift can be associated with a boat slowly drifting away from its ideal position. In the case of machine learning projects, instead of a boat drifting away, it’s the data or model drifting away from its original perceived behavior or pattern. Let’s briefly go through these types of drift.
Data drift is a form of degradation that is associated with the input data the model needs to provide a prediction value. When a deployed model experiences data drift, it means that the received that’s data during deployment doesn’t belong to the inherent distribution of the data that was trained and validated by the machine learning model. If there were ways to validate the model on the new data supplied during deployment, data drift would potentially cause a shift in the original expected metric performance obtained during model validation. An example of data drift in the context of a deep learning model can be a use case requiring the prediction of human actions outdoors. In this use case, data drift would be that the original data that was collected consisted of humans in summer clothing during the summer season, and due to winter, the people are wearing winter clothing instead.
Concept drift is a form of degradation that is associated with the change in how the input data interacts with the target output data. In the planning stage, domain experts and machine learning practitioners collaborate to define input variables that can affect the targeted output data. This defined input and output setup will subsequently be used for building a machine learning model. Sometimes, however, not all of the context that can affect the targeted output data is included as input variables due to either the availability of that data or just the lack of domain knowledge. This introduces a dependence on the conditions of the missing context associated with the data collected for machine learning. When the conditions of the missing context drift away from the base values that exist in the training and validation data, concept drift occurs, rendering the original concept irrelevant or shifted. In simpler terms, this means that the same input doesn’t map to the same output from the training data anymore. In the context of deep learning, we can take a sentiment classification use case that is based on textual data. Let’s say that a comment or speech can be a negative, neutral, or positive sentiment based on both the jurisdiction and the text itself. This means that in some jurisdictions, people have different thresholds on what is considered negative, neutral, or positive and grade things differently. Training the model with only textual data wouldn’t allow itself to generalize across jurisdictions accurately and thus would face concept drift any time it gets deployed in another jurisdiction.
Lastly, model drift is a form of degradation associated with operational metrics and easily measurable metrics. Some of the factors of degradation that can be categorized into model drift are model latency, model throughput, and model error rates. Model drift metrics are generally easy to measure and track compared to the other two types of drift.
One of the best workflows for mitigating these risks is by tracking the metrics under all of these three types of drift and having a clear path for a new machine learning model to be built, which is a process I call drift reset. Now that we’ve covered a brief overview of model degradation, in Chapter 16, Governing Deep Learning Models, and Chapter 17, Managing Drift Effectively in a Dynamic Environment, we will go more in-depth on these risks and discuss practically how to mitigate this risk in the context of deep learning.