Introduction to efficient, light, and fast transformers
Transformer-based models have achieved state-of-the-art results in many NLP problems at the cost of quadratic memory and computational complexity. We can highlight the issues regarding complexity as follows:
- The models are not able to efficiently process long sequences due to their self-attention mechanism, which scales quadratically with the sequence length.
- An experimental setup using a typical GPU with 16 GB can handle the sentences of 512 tokens for training and inference. However, longer entries can cause problems.
- NLP models keep growing in size, from the 110 million parameters of BERT-Base to the 175 billion parameters of GPT-3 and the 540 billion parameters of PaLM. This notion raises concerns about computational and memory complexity.
- We also need to care about costs, production, reproducibility, and sustainability. Hence, we need faster and lighter transformers, especially on-edge devices.