Text-guided image generation
Text-guided image generation is an interesting category of generative AI. OpenAI had several developers release a paper called Learning Transferable Visual Models From Natural Language Supervision (https://arxiv.org/abs/2103.00020). Though I prefer the summary title they posted on their blog CLIP: Connecting Text and Images. CLIP was mentioned in Chapter 8, Applying the Lessons of Deepfakes, but we’ll talk about it some more here.
CLIP
CLIP is actually a pair of neural network encoders. One is trained on images while the other is trained on text. So far, this isn’t very unusual. The real trick comes from how the two are linked. Essentially, both encoders are passed data from the same image; the image encoder gets the image, the text encoder gets the image’s description, and then the encoding they generate is compared to each other. This training methodology effectively trains two separate models to create the same output given...