DALL-E 2 and DALL-E 3
DALL-E, as with CLIP, is a multimodal model. CLIP processes text-image pairs. DALL-E processes the text and image tokens differently. DALL-E 1 has an input of a single stream of text and image of 1,280 tokens. 256 tokens are for the text, and 1,024 tokens are used for the image.
DALL-E was named after Salvador Dali and Pixar’s WALL-E. The usage of DALL-E is to enter a text prompt and produce an image. However, DALL-E must first learn how to generate images with text.
This transformer generates images from text descriptions using a dataset of text-image pairs.
We will go through the basic architecture of DALL-E to see how the model works.
The basic architecture of DALL-E
Unlike CLIP, DALL-E concatenates up to 256 BPE-encoded text tokens with 32×32 = 1,024 image tokens, as shown in Figure 16.11:
Figure 16.11: DALL-E concatenates text and image input
Figure 16.11 shows that, this time, our cat image is concatenated with...