You're reading from Generative AI with LangChain Build large language model (LLM) apps with Python, ChatGPT, and other LLMs

Product type Paperback

Published in Dec 2023

Publisher Packt

ISBN-13 9781835083468

Length 368 pages

Edition 1st Edition

Languages

Python

Tools

ChatGPT

Concepts

Artificial Intelligence

Author (1):

Ben Auffarth

View More author details

Table of Contents (13) Chapters

Preface

1. What Is Generative AI?

2. LangChain for LLM Apps FREE CHAPTER

3. Getting Started with LangChain

4. Building Capable Assistants

5. Building a Chatbot Like ChatGPT

6. Developing Software with Generative AI

7. LLMs for Data Science

8. Customizing LLMs and Their Output

9. Generative AI in Production

10. The Future of Generative Models

11. Other Books You May Enjoy

12. Index

What are text-to-image models?

Text-to-image models are a powerful type of generative AI that creates realistic images from textual descriptions. They have diverse use cases in creative industries and design for generating advertisements, product prototypes, fashion images, and visual effects. The main applications are:

Text-conditioned image generation: Creating original images from text prompts like “a painting of a cat in a field of flowers.” This is used for art, design, prototyping, and visual effects.
Image inpainting: Filling in missing or corrupted parts of an image based on the surrounding context. This can restore damaged images (denoising, dehazing, and deblurring) or edit out unwanted elements.
Image-to-image translation: Converting input images to a different style or domain specified through text, like “make this photo look like a Monet painting.”
Image recognition: Large foundation models can be used to recognize images, including classifying scenes, but also object detection, for example, detecting faces.

Models like Midjourney, DALL-E 2, and Stable Diffusion provide creative and realistic images derived from textual input or other images. These models work by training deep neural networks on large datasets of image-text pairs. The key technique used is diffusion models, which start with random noise and gradually refine it into an image through repeated denoising steps.

Popular models like Stable Diffusion and DALL-E 2 use a text encoder to map input text into an embedding space. This text embedding is fed into a series of conditional diffusion models, which denoise and refine a latent image in successive stages. The final model output is a high-resolution image aligned with the textual description.

Two main classes of models are used: Generative Adversarial Networks (GANs) and diffusion models. GAN models like StyleGAN or GANPaint Studio can produce highly realistic images, but training is unstable and computationally expensive. They consist of two networks that are pitted against each other in a game-like setting – the generator, which generates new images from text embeddings and noise, and the discriminator, which estimates the probability of the new data being real. As these two networks compete, GANs get better at their task, generating realistic images and other types of data.

The setup for training GANs is illustrated in this diagram (taken from A Survey on Text Generation Using Generative Adversarial Networks, G de Rosa and J P. Papa, 2022; https://arxiv.org/pdf/2212.11119.pdf):

A diagram of a sample

Description automatically generated

Figure 1.7: GAN training

Diffusion models have become popular and promising for a wide range of generative tasks, including text-to-image synthesis. These models offer advantages over previous approaches, such as GANs, by reducing computation costs and sequential error accumulation. Diffusion models operate through a process like diffusion in physics. They follow a forward diffusion process by adding noise to an image until it becomes uncharacteristic and noisy. This process is analogous to an ink drop falling into a glass of water and gradually diffusing.

The unique aspect of generative image models is the reverse diffusion process, where the model attempts to recover the original image from a noisy, meaningless image. By iteratively applying noise removal transformations, the model generates images of increasing resolutions that align with the given text input. The final output is an image that has been modified based on the text input. An example of this is the Imagen text-to-image model (Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding by Google Research, May 2022), which incorporates frozen text embeddings from LLMs, pre-trained on text-only corpora. A text encoder first maps the input text to a sequence of embeddings. A cascade of conditional diffusion models takes the text embeddings as input and generates images.

The denoising process is demonstrated in this plot (source: user Benlisquare via Wikimedia Commons):

A collage of a building

Description automatically generated

Figure 1.8: European-style castle in Japan, created using the Stable Diffusion V1-5 AI diffusion model

In the preceding figure, only some steps within the 40-step generation process are shown. You can see the image generation step by step, including the U-Net denoising process using the Denoising Diffusion Implicit Model (DDIM) sampling method, which repeatedly removes Gaussian noise, and then decodes the denoised output into the pixel space.

With diffusion models, you can see a wide variety of outcomes using only minimal changes to the initial setting of the model or – as in this case – numeric solvers and samplers. Although they sometimes produce striking results, the instability and inconsistency are significant obstacles to applying these models more broadly.

Stable Diffusion was developed by the CompVis group at LMU Munich (High-Resolution Image Synthesis with Latent Diffusion Models by Blattmann et al., 2022). The Stable Diffusion model significantly cuts training costs and sampling time compared to previous (pixel-based) diffusion models. The model can be run on consumer hardware equipped with a modest GPU (for example, the GeForce 40 series). By creating high-fidelity images from text on consumer GPUs, the Stable Diffusion model democratizes access. Further, the model’s source code and even the weights have been released under the CreativeML OpenRAIL-M license, which doesn’t impose restrictions on reuse, distribution, commercialization, and adaptation.

Significantly, Stable Diffusion introduced operations in latent (lower-dimensional) space representations, which capture the essential properties of an image, in order to improve computational efficiency. A VAE provides latent space compression (called perceptual compression in the paper), while a U-Net performs iterative denoising.

Stable Diffusion generates images from text prompts through several clear steps:

It starts by producing a random tensor (a random image) in the latent space, which serves as the noise for our initial image.
A noise predictor (U-Net) takes in both the latent noisy image and the provided text prompt and predicts the noise.
The model then subtracts the latent noise from the latent image.
Steps 2 and 3 are repeated for a set number of sampling steps, for instance, 40 times, as shown in the plot.
Finally, the decoder component of the VAE transforms the latent image back into pixel space, providing the final output image.

A VAE is a model that encodes data into a learned, smaller representation (encoding). These representations can then be used to generate new data similar to that used for training (decoding). The VAE is trained first.

A U-Net is a popular type of convolutional neural network (CNN) that has a symmetric encoder-decoder structure. It is commonly used for image segmentation tasks, but in the context of Stable Diffusion, it can help to introduce and remove noise in an image. The U-Net takes a noisy image (seed) as input and processes it through a series of convolutional layers to extract features and learn semantic representations.

These convolutional layers, typically organized in a contracting path, reduce the spatial dimensions while increasing the number of channels. Once the contracting path reaches the bottleneck of the U-Net, it then expands through a symmetric expanding path. In the expanding path, transposed convolutions (also known as upsampling or deconvolutions) are applied to progressively upsample the spatial dimensions while reducing the number of channels.

When training the image generation model in the latent space itself (latent diffusion model), a loss function is used to evaluate the quality of the generated images. One commonly used loss function is the Mean Squared Error (MSE) loss, which quantifies the difference between the generated image and the target image. The model is optimized to minimize this loss, encouraging it to generate images that closely resemble the desired output.

This training was performed on the LAION-5B dataset, derived from Common Crawl data, comprising billions of image-text pairs from sources such as Pinterest, WordPress, Blogspot, Flickr, and DeviantArt.

The following images illustrate text-to-image generation from a text prompt with diffusion (source: Ramesh and others, Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022; https://arxiv.org/abs/2204.06125):

A dog and a person with a mustache

Description automatically generated

Figure 1.9: Image generation from text prompts

Overall, image generation models such as Stable Diffusion and Midjourney process textual prompts into generated images, leveraging the concept of forward and reverse diffusion processes and operating in a lower-dimensional latent space for efficiency. But what about the conditioning for the model in the text-to-image use case?

The conditioning process allows these models to be influenced by specific textual prompts or input types like depth maps or outlines for greater precision to create relevant images. These embeddings are then processed by a text transformer and fed to the noise predictor, steering it to produce an image that aligns with the text prompt.

It’s out of the scope of this book to provide a comprehensive survey of generative AI models for all modalities. However, let’s get a bit of an overview of what models can do in various domains.