Knowledge distillation is a model compression technique in which a small model is trained to reproduce the behavior of a large pre-trained model. It is also referred to as teacher-student learning, where the large pre-trained model is the teacher and the small model is the student. Let's understand how knowledge distillation works with an example.
Suppose we have pre-trained a large model to predict the next word in a sentence. We call this large pre-trained model a teacher network. If we feed in a sentence and let the network predict the next word in the sentence, then it will return the probability distribution of all the words in the vocabulary being the next word, as shown in the following figure. Note that for simplicity and better understanding, we'll assume we have only five words in our vocabulary:
From the preceding figure, we can observe the probability distribution returned by the network. This...