Understanding Stable Diffusion
So far, we’ve learned how diffusion models work. Stable Diffusion improves upon the UNet2D model by first leveraging VAE to encode an image to a lower dimension and then performing training on the down-scaled/latent space. Once the model training is done, we use a VAE decoder to get a high-resolution image. This way, training is faster as the model learns features from the latent space than from the pixel values.
The architecture of Stable Diffusion is as follows:
Figure 16.17: Stable Diffusion overview
The VAE encoder is a standard auto-encoder that takes an input image of shape 768x768 and returns a 96x96 image. The VAE decoder takes a 96x96 image and upscales it to 768x768.
The pre-trained Stable Diffusion U-Net model architecture is:
Figure 16.18: Pre-trained Stable Diffusion U-Net model architecture
In the preceding diagram, noisy input represents the output obtained from the VAE encoder. Text prompt represents...