Distributed training with PyTorch
In all previous exercises in this book, we have implicitly assumed model training happens in one machine and in a single Python process in that machine. In this section, we will revisit the exercise from Chapter 1, Overview of Deep Learning Using PyTorch, and transform the model training routine from regular to distributed training. In the process, we will explore the tools PyTorch offers in distributing the training process thereby making it both faster and more hardware efficient.
Training the MNIST model in a regular fashion
The handwritten digits classification model that we built in the first chapter was in the form of a Jupyter notebook. Here, we will first put that notebook code together as a single Python script file. The full code can be found on GitHub [1]. In the following steps, we recap the different parts of the model training code:
- In the Python script, we first import the relevant libraries:
import torch...