Summary
In this chapter, we moved on from the original diffusion model, DDPM, and explained what Stable Diffusion is and why it is faster and better than the DDPM model.
As suggested by the paper High-Resolution Image Synthesis with Latent Diffusion Models [6] that introduced Stable Diffusion, the biggest feature that differentiates Stable Diffusion from its predecessor is the “Latent.” This chapter explained what latent space is and how Stable Diffusion training and inference work internally.
For a comprehensive understanding, we created components using methods such as encoding the initial image into latent data, converting input prompts to token IDs and embedding them to text embeddings using the CLIP text model, using the Stable Diffusion scheduler to sample detailed steps for inference, creating the initial noise latent, concatenating initial noise latent with the initial image latent, putting all the components together to build a custom text-to-image Stable...