The pre-trained BERT model has a large number of parameters and also high inference time, which makes it harder to use on edge devices such as mobile phones. To solve this issue, we use DistilBERT, which was introduced by researchers at Hugging Face. DistilBERT is a smaller, faster, cheaper, and lighter version of BERT.
As the name suggests, DistilBERT uses knowledge distillation. The ultimate idea of DistilBERT is that we take a large pre-trained BERT model and transfer its knowledge to a small BERT through knowledge distillation. The large pre-trained BERT is called a teacher BERT and the small BERT is called a student BERT.
Since the small BERT (student BERT) acquires its knowledge from the large pre-trained BERT (teacher BERT) through distillation, we can call our small BERT DistilBERT. DistilBERT is 60% faster and its size is 40% smaller compared to large BERT models. Now that we have a basic idea of DistilBERT, let's get...