Understanding the quantization concept
Quantization is a technique whereby the model size is reduced and its efficiency therefore improved. This technique is helpful in building models for mobile or edge deployment, where compute resources or power supply are constrained. Since our aim is to make the model run as efficiently as possible, we are also accepting the fact that the model has to become smaller and therefore less precise than the original model. This means that we are transforming the model into a lighter version of its original self, and that the transformed model is an approximation of the original one.
Quantization may be applied to a trained model. This is known as a post-training quantization API. Within this type of quantization, there are three approaches:
- Reduced float quantization: Convert
float 32 bits
ops tofloat 16
ops. - Hybrid quantization: Convert weights to
8 bits
, while keeping biases and activation as32 bits
ops. - Integer quantization...