Implementing a text-to-image Stable Diffusion inference pipeline
So far, we have all the text encoder, image VAE, and denoising UNet model initialized and loaded into the CUDA VRAM. The following steps will chain them together to form the simplest and working Stable Diffusion text-to-image pipeline:
- Initialize a latent noise: In Figure 5.2, the starting point of inference is randomly initialized Gaussian latent noise. We can create one of the latent noise with this code:
# prepare noise latents
shape = torch.Size([1, 4, 64, 64])
device = "cuda"
noise_tensor = torch.randn(
    shape,
    generator = None,
    dtype     = torch.float16
).to("cuda")
During the training stage, an initial noise sigma is used to help prevent the diffusion process from becoming stuck in local minima. When the diffusion process starts, it is very likely to be in a state where it is very close to a local...