Large-scale distributed model training
Think back to the discussion we had in the AI/ML and cloud computing section of Chapter 1, in which I described the process of scaling our models to larger sizes, starting with small models that we could train on our laptop, progressing to larger models trained on a powerful, high-end server, and eventually getting to the scale at which a single computer (even the most powerful server on the market) couldn’t handle either the size of the model or the dataset on which the model is trained. In this section, we’ll look at what it means to train such large-scale models in more detail.
We’ve covered the model training process in great detail throughout this book, but I’ll briefly summarize the process here as a knowledge refresher because these concepts are important when discussing large-scale distributed model training. For this discussion, I will focus on supervised training of neural networks.
The supervised training...