Knowledge distillation – obtaining a smaller network by mimicking the prediction
The idea of knowledge distillation was first introduced in 2015 by Hinton et al. in their publication titled Distilling the Knowledge in a Neural Network. In classification problems, Softmax activation is often used as the last operation of the network to represent the confidence for each class as a probability. Since the class with the highest probability is used for the final prediction, the probabilities for the other classes have been considered unimportant. However, the authors believe that they still consist of meaningful information representing how the model interprets the input. For example, if two classes constantly report similar probabilities for multiple samples, the two classes likely have many characteristics in common that makes the distinction between the two difficult. Such information becomes more fruitful when the network is deep because it can extract more information from the...