Chapter 7: Profile Training Jobs with Amazon SageMaker Debugger
Training machine learning (ML) models involves experimenting with multiple algorithms, with their hyperparameters typically crunching through large volumes of data. Training a model that yields optimal results is both a time- and compute-intensive task. Improved training time yields improved productivity and reduces overall training costs.
Distributed training, as we discussed in Chapter 6, Training and Tuning at Scale, goes a long way in achieving improved training times by using a scalable compute cluster. However, monitoring training infrastructure to identify and debug resource bottlenecks is not trivial. Once a training job has been launched, the process becomes non-transparent, and you don't have much visibility into the model training process. Equally non-trivial is real-time monitoring to detect sub-optimal training jobs and stop them early to avoid wasting training time and resources.
Amazon...