In this section, let's look at an interesting paper, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks by the University of Waterloo. In this paper, the researchers have explained how to perform knowledge distillation and transfer task-specific knowledge from BERT to a simple neural network. Let's get into the details and understand how exactly this works.
Teacher-student architecture
To understand how exactly we transfer task-specific knowledge from BERT to a neural network, first let's take a look at the teacher BERT and student network in detail.
The teacher BERT
We use the pre-trained BERT as the teacher BERT. Here, we use the pre-trained BERT-large as the teacher BERT. Note that, here, we are transferring task-specific knowledge from the teacher to the student. So, first, we take the pre-trained BERT-large model, fine-tune it for a specific task, and then use it as the teacher.
Suppose...