Distributed training with PyTorch
In the previous exercises in this book, we have implicitly assumed that model training happens in one machine and in a single Python process in that machine. In this section, we will revisit the exercise from Chapter 1, Overview of Deep Learning Using PyTorch – the handwritten digit classification model – and transform the model training routine from regular training into distributed training. While doing so, we will explore the tools PyTorch offers for distributing the training process, thereby making it both faster and more hardware-efficient.
First, let's look at how the MNIST
model can be trained without using distributed training. We will then contrast this with a distributed training PyTorch pipeline.
Training the MNIST model in a regular fashion
The handwritten digits classification model that we built in Chapter 1, Overview of Deep Learning Using Python, was in the form of a Jupyter notebook. Here, we will put that...