What’s new in SDXL?
SDXL is still a latent diffusion model, maintaining the same overall architecture used in Stable Diffusion v1.5. According to the original paper behind SDXL [2], SDXL expands every component, making them wider and bigger. The SDXL backbone UNet is three times larger, there are two text encoders in the SDXL base model, and a separate diffusion-based refinement model is included. The overall architecture is shown in Figure 16.1:
Figure 16.1: SDXL architecture
Note that the refiner is optional; we can decide whether to use the refiner model or not. Next, let’s drill down to each component one by one.
The VAE of the SDXL
A VAE is a pair of encoder and decoder neural networks. A VAE encoder encodes an image into a latent space, and its paired decoder can decode a latent image to a pixel image. Many articles on the web tell us that a VAE is a technique used to improve the quality of images; however, this is not the whole...