When none of the other techniques work, one last option is model distillation. The general idea is to train a small model to learn the output of a bigger model. Instead of training the small model to learn the raw labels (we could use the data for this), we train it to learn the output of the bigger model.
Let's see an example—we trained a very large network to predict an animal's breed from a picture. The output is as follows:
Because our model is too large to run on mobile, we decided to train a smaller model. Instead of training it with the labels we have, we decided to distill the knowledge of the larger network. To do so, we will use the output of the larger network as targets.
For the first picture, instead of training the new model with a target of [1, 0, 0], we will use the output of the larger network, a target of [0.9, 0.7, 0.1]. This new target is called a soft target...