You're reading from Distributed Machine Learning with Python Accelerating model training and serving with distributed systems

Product type Paperback

Published in Apr 2022

Publisher Packt

ISBN-13 9781801815697

Length 284 pages

Edition 1st Edition

Languages

Python

Tools

PyTorch

Concepts

Machine Learning

Author (1):

Guanhua Wang

View More author details

Table of Contents (17) Chapters

Preface

1. Section 1 – Data Parallelism

2. Chapter 1: Splitting Input Data FREE CHAPTER

3. Chapter 2: Parameter Server and All-Reduce

4. Chapter 3: Building a Data Parallel Training and Serving Pipeline

5. Chapter 4: Bottlenecks and Solutions

6. Section 2 – Model Parallelism

7. Chapter 5: Splitting the Model

8. Chapter 6: Pipeline Input and Layer Split

9. Chapter 7: Implementing Model Parallel Training and Serving Workflows

10. Chapter 8: Achieving Higher Throughput and Lower Latency

11. Section 3 – Advanced Parallelism Paradigms

12. Chapter 9: A Hybrid of Data and Model Parallelism

13. Chapter 10: Federated Learning and Edge Devices

14. Chapter 11: Elastic Model Training and Serving

15. Chapter 12: Advanced Techniques for Further Speed-Ups

16. Other Books You May Enjoy

Hyperparameter tuning

In this section, we will focus on the hyperparameters that are closely related to data parallel training: global batch size, learning rate adjustment, and optimizer selection.

Let's discuss them one by one.

Notes on Hyperparameters

While some of these hyperparameters have existed in the standard single-node training process, in data parallel training, these parameters may have new searching dimensions and new correlations.

Global batch size

The global batch size refers to how many training samples will be loaded into all the GPUs for training simultaneously. The counterpart of this concept in single-node training is the batch size or mini-batch.

Selecting the proper global batch size is different from selecting a single node's batch size. In single-node training, we always set the batch size to be the maximum number that can fit into the accelerator's memory without causing out-of-memory (OOM) issues. In data parallel training, given N GPUs, we may not set the global batch-size to be N*Max(single_node), where Max(single_node) refers to the maximum batch size on a single GPU.

In data parallel training, this global batch size is the first hyperparameter we need to search or fine-tune. If the global batch size is too large, the training model may not converge. If the global batch size is too small, it is just a waste of distributed computational resources.

Learning rate adjustment

Since we have used a very large global batch size compared to single node training, we also need to adjust the learning rate accordingly.

Rule of Thumb Regarding Learning Rate Adjustment

The rule of thumb policy for determining the learning rate in data parallel training is to multiply the learning rate in the single-node case by N, if we use N GPUs to do the data parallel training together.

Recent research literature suggests that, for large-batch data parallel training, we should have a warmup stage at the very beginning of the training stage. This warmup policy suggests that we should start data parallel training with a relatively small learning rate. After this warmup period, we should gradually increase the learning rate for several epochs of training, and then stop increasing the learning rate by defining a peak learning rate.

Model synchronization schemes

Now that we have chosen our optimizer (global batch size) and adjusted the learning rate accordingly, the next thing we need to do is select an appropriate model synchronization model to use. We need this because we need to initialize a group of processes to run our data parallel training job in a distributed manner, where each process will be responsible for handling model synchronization on one machine or one GPU.

Let's take pytorch as an example. Here, you need to initialize your process groups, as follows:

torch.distributed.init_process_group(backend='nccl',
                                     init_method = '...',
                                     world_size = N,
                                     timeout = M)

Here, the first parameter (backend='nccl') we need to choose from is the model synchronization backend. Right now, deep learning platforms such as PyTorch mainly support three different communication backends: NCCL, Gloo, and MPI.

The main differences among these three communication backends are as follows:

NCCL:
- GPU only
- No support for one-to-all communication primitives such as Scatter
- No support for all-to-one communication primitives such as Gather
Gloo:
- Mainly support for CPU, partial support for GPU.
- For CPU, it supports most communication primitives.
- For GPU, it only supports the most commonly used communication primitives, such as Broadcast and All-Reduce.
- No support for all-to-all communication.
MPI:
- CPU only
- Supports special hardware communication, such as IP over InfiniBand

Among these three, the following are some high-level suggestions on selecting communication schemes:

For GPU clusters, use NCCL.
For CPU clusters, use Gloo first. If that doesn't work, try MPI.

With that, we have discussed three main communication schemes we can use in data parallel training jobs. Since the nodes we have used for model training are GPUs, we usually set NCCL as our default communication backend.

You're reading from Distributed Machine Learning with Python Accelerating model training and serving with distributed systems

Table of Contents (17) Chapters

Hyperparameter tuning

Global batch size

Learning rate adjustment

Model synchronization schemes

Authors (1)

Personalised recommendations for you