Exploring models with SageMaker Debugger
SageMaker Debugger lets you configure debugging rules for your training job. These rules will inspect its internal state and check for specific unwanted conditions that could be developing during training. SageMaker Debugger includes a long list of built-in rules (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html), and you can add your own written in Python.
In addition, you can save and inspect the model state (gradients, weights, and so on) as well as the training state (metrics, optimizer parameters, and so on). At each training step, the tensors storing these values may be saved in near-real-time in an S3 bucket, making it possible to visualize them while the model is training.
Of course, you can select the tensor collections that you'd like to save, how often, and so on. Depending on the framework you use, different collections are available. You can find more information at https://github.com/awslabs...