Summary
In this chapter, you learned how to use the capabilities of Amazon SageMaker Debugger to gain visibility of the training process, training infrastructure, and training framework. This visibility allows you to react to typical training issues such as overfitting, training loss, and stopping the training jobs from running to completion, only to result in sub-optimal models. Using recommendations from the deep profiler capabilities of Amazon SageMaker, you learned how to improve training jobs with respect to training time and costs.
Using the debugger capabilities discussed in this chapter, you can continuously improve your training jobs by tweaking the underlying ML framework parameters and the training infrastructure configurations for faster and cost-effective ML training. In the next chapter, you will learn how to manage trained models at scale.