We started off the chapter by understanding what knowledge distillation is and how it works. We learned that knowledge distillation is a model compression technique in which a small model is trained to reproduce the behavior of a large pre-trained model. It is also referred to as teacher-student learning, where the large pre-trained model is the teacher and the small model is the student.
Next, we learned about DistilBERT where we take a large pre-trained BERT as a teacher and transfer its knowledge to a small BERT through knowledge distillation.
Following DistilBERT, we learned how TinyBERT works. In TinyBERT, apart from transferring knowledge from the output layer of the teacher, we also transfer knowledge from other layers, such as the embedding layer, transformer layer, and prediction layer.
Moving on, at the end of the chapter, we learned how to transfer task-specific knowledge from BERT to a simple neural network. In the next chapter, we will learn how to fine-tune the...