Identifying issues with SageMaker Debugger
Amazon SageMaker Debugger is one of the more powerful capabilities of Amazon SageMaker that can help us manage our ML experiments. With SageMaker Debugger, we can automatically detect issues and profile training jobs using Debugger rules. We are then able to eliminate these issues and bottlenecks, which would help improve training time and significantly reduce costs. SageMaker Debugger can also be used to monitor the hardware resource usage of training jobs. This feature can help significantly reduce costs as we are able to profile training jobs, detect issues caused by hardware resource usage early, and optimize training time and resource usage. SageMaker Debugger supports ML frameworks and algorithms such as XGBoost, PyTorch, TensorFlow, and MXNet.
There are several built-in Debugger rules to choose from. These include (but are not limited to) the VanishingGradient
, PoorWeightInitialization
, ExplodingTensor
, DeadRelu
, and LossNotDecreasing...