Gaining insight into the training infrastructure and training framework
In this section, you will learn how to gain visibility into the resource utilization of the training infrastructure and the training framework. You will also learn how to analyze and implement recommendations provided by the deep profiler capability of SageMaker Debugger.
Debugger profiler provides you with visibility into the utilization of the infrastructure running ML training jobs on SageMaker. Debugger automatically monitors system resources such as CPU, GPU, network, I/O, and memory. Additionally, Debugger collects metrics specific to the training framework such as step duration, data loading, preprocessing, and operator runtime on CPU and GPU. You can decide to profile the training job in its entirety or just portions of it to collect the necessary framework metrics.
In addition to collecting the system and framework metrics, behind the scenes, Debugger correlates these metrics automatically, which...