Utilizing Managed Spot Training and Checkpoints
Now that we have a better understanding of how to use the SageMaker Python SDK to train and deploy ML models, let’s proceed with using a few additional options that allow us to reduce costs significantly when running training jobs. In this section, we will utilize the following SageMaker features and capabilities when training a second Image Classification model:
- Managed Spot Training
- Checkpointing
- Incremental Training
In Chapter 2, Deep Learning AMIs, we mentioned that spot instances can be used to reduce the cost of running training jobs. Using spot instances instead of on-demand instances can help reduce the overall cost by up to 70% to 90%. So, why are spot instances cheaper? The downside of using spot instances is that these instances can be interrupted, which will restart the training job from the start. If we were to train our models outside of SageMaker, we would have to prepare our own set of custom...