You're reading from Amazon SageMaker Best Practices Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker

Product type Paperback

Published in Sep 2021

Publisher Packt

ISBN-13 9781801070522

Length 348 pages

Edition 1st Edition

Languages

Python

Tools

Amazon SimpleDB

Concepts

Machine Learning

Authors (3):

Randy DeFauw

Shelbee Eigenbrode

Sireesha Muppala

View More author details

Table of Contents (20) Chapters

Preface

1. Section 1: Processing Data at Scale

2. Chapter 1: Amazon SageMaker Overview FREE CHAPTER

3. Chapter 2: Data Science Environments

4. Chapter 3: Data Labeling with Amazon SageMaker Ground Truth

5. Chapter 4: Data Preparation at Scale Using Amazon SageMaker Data Wrangler and Processing

6. Chapter 5: Centralized Feature Repository with Amazon SageMaker Feature Store

7. Section 2: Model Training Challenges

8. Chapter 6: Training and Tuning at Scale

9. Chapter 7: Profile Training Jobs with Amazon SageMaker Debugger

10. Section 3: Manage and Monitor Models

11. Chapter 8: Managing Models at Scale Using a Model Registry

12. Chapter 9: Updating Production Models Using Amazon SageMaker Endpoint Production Variants

13. Chapter 10: Optimizing Model Hosting and Inference Costs

14. Chapter 11: Monitoring Production Models with Amazon SageMaker Model Monitor and Clarify

15. Section 4: Automate and Operationalize Machine Learning

16. Chapter 12: Machine Learning Automated Workflows

17. Chapter 13:Well-Architected Machine Learning with Amazon SageMaker

18. Chapter 14: Managing SageMaker Features across Accounts

19. Other Books You May Enjoy

Chapter 7: Profile Training Jobs with Amazon SageMaker Debugger

Training machine learning (ML) models involves experimenting with multiple algorithms, with their hyperparameters typically crunching through large volumes of data. Training a model that yields optimal results is both a time- and compute-intensive task. Improved training time yields improved productivity and reduces overall training costs.

Distributed training, as we discussed in Chapter 6, Training and Tuning at Scale, goes a long way in achieving improved training times by using a scalable compute cluster. However, monitoring training infrastructure to identify and debug resource bottlenecks is not trivial. Once a training job has been launched, the process becomes non-transparent, and you don't have much visibility into the model training process. Equally non-trivial is real-time monitoring to detect sub-optimal training jobs and stop them early to avoid wasting training time and resources.

Amazon...