Troubleshooting accelerator performance
Before we can analyze our GPU performance, we need to understand generally how to debug and analyze performance on our training platform. SageMaker has some really nice solutions for this. First, all of your logs are sent to Amazon CloudWatch, another AWS service that can help you monitor your job performance. Each node in your cluster will have a full dedicated log stream, and you can read that log stream to view your overall training environment, how SageMaker runs your job, what status your job is in, and all of the logs your script emits. Everything you write to standard out, or print statements, is automatically captured and stored in CloudWatch. The first step to debugging your code is to take a look at the logs and figure out what really went wrong.
Once you know what’s wrong in your script, you’ll probably want to quickly fix it and get it back online, right? That’s why we introduced managed warm pools on SageMaker...