Best practices for reliable ML workloads
For a reliable system, there are two considerations at the core:
- First, the ability to recover from planned and unplanned disruptions
- Second, the ability to meet unpredictable increases in traffic demands
Ideally, the system should achieve both without affecting downstream applications and end consumers. In this section, we will discuss best practices for building reliable ML workloads using a combination of SageMaker and related AWS services.
Let's now look at some best practices for securing ML workloads on AWS in the following sections.
Recovering from failure
For an ML workload, the ability to recover gracefully should be part of all the steps that make up the iterative ML process. A failure can occur with data storage, data processing, model training, or model hosting, which may result from a variety of events ranging from system failure to human error.
For ML on SageMaker, all data (and model artifacts...