Combining model and data parallel
As you may have suspected previously, and as is empirically evidenced by scaling laws, large models are only effective when combined with large datasets. That is to say, if you use an extremely large model with a small or moderately sized dataset, you are extremely likely to overfit your model. This means it may eventually learn how to replicate the core examples you’ve provided, but it is very unlikely to handle new challenges well.
Surprisingly, the reverse is not necessarily true. As a general rule of thumb, it is helpful to increase the model size with the dataset size. However, in most computer vision cases, model sizes rarely surpass the memory sizes of single GPUs. I can say the majority of vision customers I work with, from autonomous vehicles to manufacturing, financial services to health care, tend to work with models that can fit quite nicely onto single GPUs. In these cases, data parallel alone is a strong candidate to improve...