Optimization solution 4 – enabling sequential CPU offload
As we discussed in Chapter 5, one pipeline includes several sub-models:
- Text embedding model used to encode text to embeddings
- Image latent encoder/decoder used to encode the input guidance image and decode latent space to pixel images
- The UNet will loop the inference denoising steps
- The safety checker model checks the safety of the generated content
The idea of sequential CPU offload is offloading idle submodels to CPU RAM when it finishes its task and is idle.
Here is an example of how it works step by step:
- Load the CLIP text model to the GPU VRAM and encode the input prompt to embeddings.
- Offload the CLIP text model to CPU RAM.
- Load the VAE model (the image to latent space encoder and decoder) to the GPU VRAM and encode the start image if the current task is an image-to-image pipeline.
- Offload the VAE to the CPU RAM.
- Load UNet to loop through the denoising steps...