Stable Diffusion in latent space
Instead of processing diffusion in pixel space, Stable Diffusion uses latent space to represent an image. What is latent space? In short, latent space is the vector representation of an object. To use an analogy, before you go on a blind date, a matchmaker could provide you with your counterpart’s height, weight, age, hobbies, and so on in the form of a vector:
[height, weight, age, hobbies,...]
You can take this vector as the latent space of your blind date counterpart. A real person’s true property dimension is almost unlimited (you could write a biography for one). The latent space can be used to represent a real person with only a limited number of features, such as height, weight, and age.
In the case of the Stable Diffusion training stage, a trained encoder model, usually denoted as ℇ (E), is used to encode an input image in a latent vector representation. After the reverse diffusion process, the latent space is decoded...