Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Distributed Machine Learning with Python

You're reading from   Distributed Machine Learning with Python Accelerating model training and serving with distributed systems

Arrow left icon
Product type Paperback
Published in Apr 2022
Publisher Packt
ISBN-13 9781801815697
Length 284 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Guanhua Wang Guanhua Wang
Author Profile Icon Guanhua Wang
Guanhua Wang
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Section 1 – Data Parallelism
2. Chapter 1: Splitting Input Data FREE CHAPTER 3. Chapter 2: Parameter Server and All-Reduce 4. Chapter 3: Building a Data Parallel Training and Serving Pipeline 5. Chapter 4: Bottlenecks and Solutions 6. Section 2 – Model Parallelism
7. Chapter 5: Splitting the Model 8. Chapter 6: Pipeline Input and Layer Split 9. Chapter 7: Implementing Model Parallel Training and Serving Workflows 10. Chapter 8: Achieving Higher Throughput and Lower Latency 11. Section 3 – Advanced Parallelism Paradigms
12. Chapter 9: A Hybrid of Data and Model Parallelism 13. Chapter 10: Federated Learning and Edge Devices 14. Chapter 11: Elastic Model Training and Serving 15. Chapter 12: Advanced Techniques for Further Speed-Ups 16. Other Books You May Enjoy

Hyperparameter tuning

In this section, we will focus on the hyperparameters that are closely related to data parallel training: global batch size, learning rate adjustment, and optimizer selection.

Let's discuss them one by one.

Notes on Hyperparameters

While some of these hyperparameters have existed in the standard single-node training process, in data parallel training, these parameters may have new searching dimensions and new correlations.

Global batch size

The global batch size refers to how many training samples will be loaded into all the GPUs for training simultaneously. The counterpart of this concept in single-node training is the batch size or mini-batch.

Selecting the proper global batch size is different from selecting a single node's batch size. In single-node training, we always set the batch size to be the maximum number that can fit into the accelerator's memory without causing out-of-memory (OOM) issues. In data parallel training, given N GPUs, we may not set the global batch-size to be N*Max(single_node), where Max(single_node) refers to the maximum batch size on a single GPU.

In data parallel training, this global batch size is the first hyperparameter we need to search or fine-tune. If the global batch size is too large, the training model may not converge. If the global batch size is too small, it is just a waste of distributed computational resources.

Learning rate adjustment

Since we have used a very large global batch size compared to single node training, we also need to adjust the learning rate accordingly.

Rule of Thumb Regarding Learning Rate Adjustment

The rule of thumb policy for determining the learning rate in data parallel training is to multiply the learning rate in the single-node case by N, if we use N GPUs to do the data parallel training together.

Recent research literature suggests that, for large-batch data parallel training, we should have a warmup stage at the very beginning of the training stage. This warmup policy suggests that we should start data parallel training with a relatively small learning rate. After this warmup period, we should gradually increase the learning rate for several epochs of training, and then stop increasing the learning rate by defining a peak learning rate.

Model synchronization schemes

Now that we have chosen our optimizer (global batch size) and adjusted the learning rate accordingly, the next thing we need to do is select an appropriate model synchronization model to use. We need this because we need to initialize a group of processes to run our data parallel training job in a distributed manner, where each process will be responsible for handling model synchronization on one machine or one GPU.

Let's take pytorch as an example. Here, you need to initialize your process groups, as follows:

torch.distributed.init_process_group(backend='nccl',
                                     init_method = '...',
                                     world_size = N,
                                     timeout = M)
  

Here, the first parameter (backend='nccl') we need to choose from is the model synchronization backend. Right now, deep learning platforms such as PyTorch mainly support three different communication backends: NCCL, Gloo, and MPI.

The main differences among these three communication backends are as follows:

  • NCCL:
    • GPU only
    • No support for one-to-all communication primitives such as Scatter
    • No support for all-to-one communication primitives such as Gather
  • Gloo:
    • Mainly support for CPU, partial support for GPU.
    • For CPU, it supports most communication primitives.
    • For GPU, it only supports the most commonly used communication primitives, such as Broadcast and All-Reduce.
    • No support for all-to-all communication.
  • MPI:
    • CPU only
    • Supports special hardware communication, such as IP over InfiniBand

Among these three, the following are some high-level suggestions on selecting communication schemes:

  • For GPU clusters, use NCCL.
  • For CPU clusters, use Gloo first. If that doesn't work, try MPI.

With that, we have discussed three main communication schemes we can use in data parallel training jobs. Since the nodes we have used for model training are GPUs, we usually set NCCL as our default communication backend.

You have been reading a chapter from
Distributed Machine Learning with Python
Published in: Apr 2022
Publisher: Packt
ISBN-13: 9781801815697
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime