In this section, we will learn about A Lite version of BERT, also known as ALBERT. One of the challenges with BERT is that it consists of millions of parameters. BERT-base consists of 110 million parameters, which makes it harder to train, and it also has a high inference time. Increasing the model size gives us good results but it puts a limitation on the computational resources. To combat this, ALBERT was introduced. ALBERT is a lite version of BERT with fewer parameters compared to BERT. It uses the following two techniques to reduce the number of parameters:
- Cross-layer parameter sharing
- Factorized embedding layer parameterization
By using the preceding two techniques, we can reduce the training time and inference time of the BERT model. First, let's understand how these two techniques work in detail, and then we will see how ALBERT is pre-trained.
Cross-layer parameter sharing
Cross-layer parameter sharing is an interesting method for reducing the...