You're reading from Generative AI Application Integration Patterns Integrate large language models into your applications

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835887608

Length 218 pages

Edition 1st Edition

Languages

Python

Tools

TensorFlow

Concepts

Artificial Intelligence

Authors (2):

Juan Pablo Bustos

Luis Lopez Soria

View More author details

Table of Contents (13) Chapters

Preface

1. Introduction to Generative AI Patterns FREE CHAPTER

2. Identifying Generative AI Use Cases

3. Designing Patterns for Interacting with Generative AI

4. Generative AI Batch and Real-Time Integration Patterns

5. Integration Pattern: Batch Metadata Extraction

6. Integration Pattern: Batch Summarization

7. Integration Pattern: Real-Time Intent Classification

8. Integration Pattern: Real-Time Retrieval Augmented Generation

9. Operationalizing Generative AI Integration Patterns

10. Embedding Responsible AI into Your GenAI Applications

11. Other Books You May Enjoy

12. Index

General generative AI concepts

When integrating generative AI into practical applications, it is important to have an understanding of concepts such as model architecture and training. In this section, we cover an overview of prominent concepts, including transformers, diffusion models, pre-training, and prompt engineering, that enable systems to generate impressively accurate text, images, audio, and more.

Understanding these core concepts will equip you to make informed decisions when selecting foundations for your use cases. However, putting models into production requires further architectural considerations. We will be highlighting these decision points in the rest of the chapters in the book and in practical examples.

Generative AI model architectures

Generative AI models are based on specialized neural network architectures optimized for generative tasks. The two more widely known models are transformers and diffusion models.

Transformer models are not a new concept. They were first introduced by Google in a 2017 paper called Attention Is All You Need (https://arxiv.org/pdf/1706.03762.pdf). The paper explains the Transformer neural network architecture, which is entirely based on attention mechanisms using the encoder and decoder concepts. This architecture enables models to identify relationships across an input text. By having these relationships, the model predicts the next token, leveraging its previous prediction as an input, creating this recursive loop to generate new content.

Diffusion models have drawn considerable interest as generative models due to their foundation in the physical processes of non-equilibrium thermodynamics. In physics, diffusion refers to the motion of particles from areas of high concentration to low concentration over time. Diffusion models try to mimic this concept in their training process. These models are trained through two phases: the forward diffusion process adds “noise” to the original training data, followed by a reverse conditioning process, which then learns how to remove noise in the reverse diffusion process. By learning this process, these models can produce samples by starting from pure noise and letting the reverse diffusion model clear away unnecessary “noise” and preserving the desired “generated” content.

Other types of deep learning architectures, such as Generative Adversarial Networks (GANs), allow you to generate synthetic data based on existing data. GANs are useful because they leverage two models: one to generate a synthetic output and another one that tries to predict if this output is real or fake.

Through this iterative process, we can generate data that is indistinguishable from the real data but different enough to be used to enhance our training datasets. Another example of data generation architectures is Variational Autoencoders (VAEs), which use an encoder-decoder approach to generate new data samples resembling their training datasets.

Techniques available to optimize foundational models

There are several techniques used to develop and optimize foundational models that have driven significant gains in AI capabilities, some of which are more complex than others from a technical and monetary perspective:

Pre-training refers to fully training a model on a large dataset. It allows models to learn very broad representations from billions of data points, which help the model adapt to other closely related tasks. Popular methods include contrastive self-supervised pre-training on unlabeled data and pre-training on vast supervised data like the internet.
Fine-tuning adapts a pre-trained model’s already learned feature representations to perform a specific task. This only tunes some higher-level model layers rather than training from scratch. On the other hand, adapter tuning equips models with small, lightweight adapters that can rapidly tune to new tasks without interfering with existing capabilities. These pluggable adapters give a parameter-efficient way of accumulating knowledge across multiple tasks by learning task-specific behaviors while reusing the bulk of model weights. They help mitigate forgetting previous tasks and simplify personalization. For example, models may first be pre-trained on billions of text webpages to acquire general linguistic knowledge, before being fine-tuned on more domain-specific datasets for question answering, classification, etc.
Distillation uses a “teacher” model to train a smaller “student” model to reproduce the performance of the larger pre-trained model at a lower cost and latency. Quantizing and compressing large models into efficient forms for deployments also helps optimize performance and cost.

The combination of comprehensive pre-training followed by specialized fine-tuning, adapter tuning, and portable distillation has enabled unprecedented versatility of deep learning across domains. Each approach smartly reuses and transfers available knowledge, enabling the customization and scaling of generative AI.

Techniques to augment your foundational model responses

In addition to architecture and training advances, progress in generative AI has been fueled by innovations in how these models are augmented by external data at inference time.

Prompt engineering tunes the text prompts provided to models to steer their generation quality, capabilities, and properties. Well-designed prompts guide the model to produce the desired output format, reduce ambiguity, and provide helpful contextual constraints. This allows simpler model architectures to solve complex problems by encoding human knowledge into the prompts.

Retrieval augmented generation, also known as RAG, enhances text generation through efficient retrieval of relevant knowledge from external stores. Models receive contextual pieces of information as “context” to be considered as additional input before generating its output. Grounding LLMs (large language models) refers to providing model-specific factual knowledge rather than just model parameters, enabling more accurate, knowledgeable, and specific language generation.

Together, these approaches augment basic predictive language models to become far more versatile, robust, and scalable. They reduce brittleness via tight integration of human knowledge and grounded learning rather than just statistical patterns. RAG handles the breadth and real-time retrieval of information, prompts provide depth and rules to the desired outputs, and grounding binds them to reality. We would highly encourage readers to get familiar with this topic, as it is an industry best practice to perform RAG and to ground your model to prevent it from hallucinating. A good start is the following paper: Retrieval-Augmented Generation for Large Language Models: A Survey (https://arxiv.org/pdf/2312.10997).