TinyBERT is another interesting variant of BERT that also uses knowledge distillation. With DistilBERT, we learned how to transfer knowledge from the output layer of the teacher BERT to the student BERT. But apart from this, can we also transfer knowledge from the other layers of the teacher BERT? Yes! Apart from transferring knowledge from the output layer of the teacher to the student BERT, we can also transfer knowledge from other layers.
In TinyBERT, apart from transferring knowledge from the output layer (prediction layer) of the teacher to the student, we also transfer knowledge from embedding and encoder layers.
Let's understand this with an example. Suppose we have a teacher BERT with N encoder layers. For simplicity, we have shown only one encoder layer in the following figure. The following figure depicts the pre-trained teacher BERT model where we feed a masked sentence and it returns the logits of all the words in our vocabulary being the masked...