You're reading from Learn Amazon SageMaker A guide to building, training, and deploying machine learning models for developers and data scientists

Product type Paperback

Published in Nov 2021

Publisher Packt

ISBN-13 9781801817950

Length 554 pages

Edition 2nd Edition

Languages

Python

Tools

AWS

Concepts

Machine Learning

Author (1):

Julien Simon

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Introduction to Amazon SageMaker

2. Chapter 1: Introducing Amazon SageMaker FREE CHAPTER

3. Chapter 2: Handling Data Preparation Techniques

4. Section 2: Building and Training Models

5. Chapter 3: AutoML with Amazon SageMaker Autopilot

6. Chapter 4: Training Machine Learning Models

7. Chapter 5: Training CV Models

8. Chapter 6: Training Natural Language Processing Models

9. Chapter 7: Extending Machine Learning Services Using Built-In Frameworks

10. Chapter 8: Using Your Algorithms and Code

11. Section 3: Diving Deeper into Training

12. Chapter 9: Scaling Your Training Jobs

13. Chapter 10: Advanced Training Techniques

14. Section 4: Managing Models in Production

15. Chapter 11: Deploying Machine Learning Models

16. Chapter 12: Automating Machine Learning Workflows

17. Chapter 13: Optimizing Prediction Cost and Performance

18. Other Books You May Enjoy

Distributing training jobs

Distributed training lets you scale training jobs by running them on a cluster of CPU or GPU instances. It can be used to solve two different problems: very large datasets, and very large models.

Understanding data parallelism and model parallelism

Some datasets are too large to be trained in a reasonable amount of time on a single CPU or GPU. Using a technique called data parallelism, we can distribute data across the training cluster. The full model is still loaded on each CPU/GPU, which only receive an equal share of the dataset, not the full dataset. In theory, this should speed up training linearly according to the number of CPU/GPUs involved, and as you can guess, the reality is often different.

Believe it or not, some state-of-the-art-deep learning models are too large to fit on a single GPU. Using a technique called model parallelism, we can split it, and distribute the layers across a cluster of GPUs. Hence, training batches will flow across...