Monitoring Deep Learning Endpoints in Production
Due to the difference in development and production settings, it is difficult to assure the performance of deep learning (DL) models once they are deployed. If any difference exists in model behavior, it must be captured within a reasonable time; otherwise, it can affect downstream applications in negative ways.
In this chapter, our goal is to explain existing solutions for monitoring DL model behavior in production. We will start by clearly describing the benefit of monitoring and what it takes to keep the overall system running in a stable manner. Then, we will discuss popular tools for monitoring DL models and alerting. Out of the various tools we introduce, we will get our hands dirty with CloudWatch. We will start with the basics of CloudWatch and discuss how to integrate CloudWatch into endpoints running on SageMaker and Elastic Kubernetes Service (EKS) clusters.
In this chapter, we’re going to cover the following...