Chapter 10: Scaling and Managing Training
So far, we have been on an exciting journey in the realm of Deep Learning (DL). We have learned how to recognize images, how to create new images or generate new texts, and how to train machines without fully labeled sets. It's an open secret that achieving good results for a DL model requires a massive amount of compute power, often requiring the help of a Graphics Processing Unit (GPU). We have come a long way since the early days of DL when data scientists had to manually distribute the training to each node of the GPU. PyTorch Lightning obfuscates most of the complexities associated with managing underlying hardware or pushing down training to the GPU.
In the earlier chapters, we have pushed down training via brute force. However, doing so is not practical when you have to deal with a massive training effort for large-scale data. In this chapter, we will take a nuanced view of the challenges of training a model at scale and managing...