Managing long-running jobs with checkpointing and spot training
Training ML models at scale can be costly. Even with SageMaker's pay-as-you-go pricing model on the training instances, performing long-running deep learning training and using multiple expensive instances can add up quickly. SageMaker's fully managed spot training and checkpointing features allow us to manage and resume long-running jobs easily, helping us reduce costs up to 90% on training instances over on-demand instances.
SageMaker-managed Spot training uses the concept of spot instances from Amazon EC2. EC2 spot instances let you take advantage of any unused instance capacity in an AWS Region at a much lower cost compared to regular on-demand instances. The spot instances are cheaper but can be interrupted when there is a higher demand for instances from other users on AWS. SageMaker-managed spot training manages the use of spot instances, including safe interruption and timely resumption of your training...