Generating images with stable diffusion
In this section, we’ll introduce stable diffusion (SD, High-Resolution Image Synthesis with Latent Diffusion Models, https://arxiv.org/abs/2112.10752, https://github.com/Stability-AI/stablediffusion). This is a generative model that can synthesize images based on text prompts or other types of data (in this section, we’ll focus on the text-to-image scenario). To understand how it works, let’s start with the following figure:
Figure 9.5 – Stable diffusion model and training. Inspired by https://arxiv.org/abs/2112.10752
SD combines an autoencoder (AE, the Pixel space section of Figure 9.5), denoising diffusion probabilistic models (DDPM or simply DM, the Latent distribution space section of Figure 9.5 and Chapter 5), and transformers (the Conditioning section of Figure 9.5). Before we dive into each of these components, let’s outline their role in the training and inference pipelines...