Quantization – doing more with less
Quantization is a model optimization technique that converts the precision of the numbers used in a model from higher precision formats, such as 32-bit floating-point, to lower precision formats, such as 8-bit integers. The main goals of quantization are to reduce the model size and to make it run faster during inference, which is the process of making predictions using the model.
When quantizing an LLM, several key benefits and considerations come into play, which we will discuss next.
Model size reduction
Model size reduction via quantization is an essential technique for adapting LLMs to environments with limited storage and memory. The process involves several key aspects:
- Bit precision: Traditional LLMs often use 32-bit floating-point numbers to represent the weights in their neural networks. Quantization reduces these to lower-precision formats, such as 16-bit, 8-bit, or even fewer bits. The reduction in bit precision...