Optimization
Optimizing a transformer involves building lightweight, responsive, and energy-efficient models. Let’s see the most common ideas adopted to optimize a model.
Quantization
The key idea behind quantization is to approximate the weights of a network with a smaller precision. The idea is very simple, but it works quite well in practice. If you are interested in knowing more, we recommend the paper A Survey of Quantization Methods for Efficient Neural Network Inference, by Amir Gholami et al., https://arxiv.org/pdf/2103.13630.pdf.
Weight pruning
The key idea behind weight pruning is to remove some connections in the network. Magnitude-based weight pruning tends to zero out of model weights during training to increase model sparsity. This simple technique has benefits both in terms of model size and in cost of serving, as magnitude-based weight pruning gradually zeroes out of model weights during the training process to achieve model sparsity. Sparse...