You're reading from Distributed Machine Learning with Python Accelerating model training and serving with distributed systems

Product type Paperback

Published in Apr 2022

Publisher Packt

ISBN-13 9781801815697

Length 284 pages

Edition 1st Edition

Languages

Python

Tools

PyTorch

Concepts

Machine Learning

Author (1):

Guanhua Wang

View More author details

Table of Contents (17) Chapters

Preface

1. Section 1 – Data Parallelism

2. Chapter 1: Splitting Input Data FREE CHAPTER

3. Chapter 2: Parameter Server and All-Reduce

4. Chapter 3: Building a Data Parallel Training and Serving Pipeline

5. Chapter 4: Bottlenecks and Solutions

6. Section 2 – Model Parallelism

7. Chapter 5: Splitting the Model

8. Chapter 6: Pipeline Input and Layer Split

9. Chapter 7: Implementing Model Parallel Training and Serving Workflows

10. Chapter 8: Achieving Higher Throughput and Lower Latency

11. Section 3 – Advanced Parallelism Paradigms

12. Chapter 9: A Hybrid of Data and Model Parallelism

13. Chapter 10: Federated Learning and Edge Devices

14. Chapter 11: Elastic Model Training and Serving

15. Chapter 12: Advanced Techniques for Further Speed-Ups

16. Other Books You May Enjoy

Issues with the parameter server

In recent years, fewer and fewer machine learning practitioners have been using the parameter server paradigm for their data parallel training jobs. The main reason for this decrease in the popularity of the parameter server architecture is twofold.

Given N nodes, it is unclear what the best ratio is between the parameter server and workers.

As we've mentioned previously, in the parameter server architecture, we have two roles:

Parameter server:
- Never do training, 0 training BW
- More PS, higher communication BW, less model synchronization latency
Worker:
- More Workers, higher training BW
- More Workers, more data transfer, higher model synchronization overhead

We need to balance training throughput and communication latency. We will discuss this trade-off in the following two cases.

Case 1 – more parameter servers

If we assign more nodes as parameter servers, we have fewer data to communicate since we have fewer...

The rest of the chapter is locked

You're reading from Distributed Machine Learning with Python Accelerating model training and serving with distributed systems

Table of Contents (17) Chapters

Issues with the parameter server

Case 1 – more parameter servers

Authors (1)

Personalised recommendations for you

You're reading from Distributed Machine Learning with Python Accelerating model training and serving with distributed systems

Table of Contents (17) Chapters

Issues with the parameter server

Case 1 – more parameter servers

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you