Understanding text-to-image generation using diffusion
Recall Figure 10.8, where we demonstrated the training process of the UNet model for generating images using diffusion. We trained the UNet model to learn noise from an input noisy image. To facilitate text-to-image generation, we need to add text as an additional input to this UNet model, as demonstrated in Figure 10.18 (in contrast to Figure 10.8):
Figure 10.18: UNet trained on both an input (noisy) image as well as text to predict the noise within the noisy image
Such a UNet model is called a conditional UNet model [11], or a text-conditional UNet model to be precise, as this model generates an image conditioned on the input text. So, how do we train such a model?
There are two parts to the answer to this question. We first need to encode the input text into an embedding vector that can be ingested into the UNet model. Then we need to modify the UNet model slightly to accommodate the extra incoming data (besides...