Text-to-image synthesis consists of synthesizing an image that satisfies specifications described in a text sentence. Text-to-image synthesis can be interpreted as a translation problem where the domain of the source and the target are not the same.
In this approach, the problem of text-to-image synthesis is tackled by solving two sub-problems. The first relates to learning a representation of text that encodes the visual specifications described with the text, and the second learning a model that is capable of using the text representation learned to synthesize images that satisfy the specifications described in the text.
For example, consider this text description: the petals on this flower are white with a yellow center.
Although broad and not defining many aspects of the target flower, this description provides a few hard specifications about the flower...