Monitoring and profiling training jobs with Amazon SageMaker Debugger
SageMaker Debugger includes a monitoring and profiling capability that lets us collect infrastructure and code performance information at much lower time resolution than CloudWatch (as often as every 100 milliseconds). It also allows us to configure and trigger built-in or custom rules that watch for unwanted conditions in our training jobs.
Profiling is very easy to use, and in fact, it's on by default! You may have noticed a line such as this one in your training log:
2021-06-14 08:45:30 Starting - Launching requested ML instancesProfilerReport-1623660327: InProgress
This tells us that SageMaker is automatically running a profiling job, in parallel with our training job. The role of the profiling job is to collect data points that we can then display in SageMaker Studio, in order to visualize metrics and understand potential performance issues.