Distributed Training of Machine Learning Models
When it comes to Machine Learning (ML) model training, the primary goal for a data scientist or ML practitioner is to train the optimal model based on the relevant data to address the business use case. While this goal is of primary importance, the panacea is to perform this task as quickly and effectively as possible. So, how do we speed up model training? Moreover, sometimes, the data or the model might be too big to fit into a single GPU memory. So how do we prevent out-of-memory (OOM) errors?
The simplest answer to this question is to basically throw more compute resources, in other words, more CPUs and GPUs, at the problem. This is essentially using larger compute hardware and is commonly referred to as a scale-up strategy. However, there is only a finite number of CPUs and GPUs that can be squeezed into a server. So, sometimes a scale-out strategy is required, whereby we add more servers into the mix, essentially distributing...