Infrastructure monitoring and alerting
The main dimensions of monitoring in ML systems from an infrastructure perspective do not differ from those in traditional software systems.
In order to illustrate this exact issue, we will leverage the monitoring and alerting tools available in AWS CloudWatch and SageMaker to illustrate an example of setting up monitoring and alerting infrastructure. This same mechanism can be set up with tools such as Grafana/Prometheus for on-premises and cloud deployments alike. These monitoring tools achieve similar goals and provide comparable features, so you should choose the most appropriate depending on your environment and cloud provider.
AWS CloudWatch provides a monitoring and observability solution. It allows you to monitor your applications, respond to system-wide performance changes, optimize resource use, and receive a single view of operational health.
At a higher level, we can split the infrastructure monitoring and alerting components...