Easier quantization using bitsandbytes
Although quantization provides smaller models with reducing the accuracy, it is always important to use GPU friendly functions to get the most out of it. One of very useful libraries that implements NVidia CUDA custom functions for this specific use case (8-bit quantization) is bitsandbytes.
Hugging face’s Transformers library also integrated this functionality for easier use. All you must do is to first change your runtime to GPU and then install bitsandbytes and accelerator:
pip install bitsandbytes pip install accelerate
And then you can easily load any model you want using 8-bit quantization:
from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'decapoda-research/llama-7b-hf', load_in_8bit=True)
As you can see, using the quantization with transformers is very easy; just setting load_in_8bit
to True
will do the job.
This is...