The generation of images from description text can be considered as a Conditional GAN (CGAN) process in which the embedding vector of the description sentence is used as the additional label information. Luckily for us, we already know how to use CGAN models to generate convincing images. Now, we need to figure out how to generate large images with CGAN.
Do you remember how we used two generators and two discriminators to fill out the missing holes in images (image inpainting) in Chapter 7, Image Restoration with GANs? It's also possible to stack two CGANs together so that we can get high-quality images. This is exactly what StackGAN does.