Monitoring model training and compute resources with SageMaker Debugger
Training ML models using sagemaker.estimator.Estimator
and related classes, such as sagemaker.pytorch.estimator.PyTorch
and sagemaker.tensorflow.estimator.TensorFlow
, gives us the flexibility and scalability we need when developing in SageMaker Studio. However, due to the use of remote compute resources, it is rather different debugging and monitoring training jobs on a local machine or a single EC2 machine to how you would on a SageMaker Studio notebook. Being an IDE for ML, SageMaker Studio provides a comprehensive view of the managed training jobs through SageMaker Debugger. SageMaker Debugger helps developers monitor the compute resource utilization, detect modeling-related issues, profile deep learning operations, and identify bottlenecks during the runtime of your training jobs.
SageMaker Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost. By default, SageMaker Debugger is enabled in every SageMaker...