You're reading from Generative AI Application Integration Patterns Integrate large language models into your applications

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835887608

Length 218 pages

Edition 1st Edition

Languages

Python

Tools

TensorFlow

Concepts

Artificial Intelligence

Authors (2):

Luis Lopez Soria

Juan Pablo Bustos

View More author details

Table of Contents (13) Chapters

Preface

1. Introduction to Generative AI Patterns FREE CHAPTER

2. Identifying Generative AI Use Cases

3. Designing Patterns for Interacting with Generative AI

4. Generative AI Batch and Real-Time Integration Patterns

5. Integration Pattern: Batch Metadata Extraction

6. Integration Pattern: Batch Summarization

7. Integration Pattern: Real-Time Intent Classification

8. Integration Pattern: Real-Time Retrieval Augmented Generation

9. Operationalizing Generative AI Integration Patterns

10. Embedding Responsible AI into Your GenAI Applications

11. Other Books You May Enjoy

12. Index

Inference

After constructing the prompt with the retrieved context and the user’s query, the next step is to submit the formatted prompt directly to Vertex AI’s API endpoint to be processed by Gemini 1.5 Flash. This is where the actual generation of the response takes place. In the following code snippet, the generate() function is responsible for sending the prompt to the Gemini 1.5 Flash model and obtaining the generated response:

#This is the section where we submit the full prompt and 
#context to the LLM
result = generate(prompt)

The generate() function encapsulates the configuration and settings required for the generation process. It includes two main components: generation_config and safety_settings.

The generation_config dictionary specifies the parameters that control the behavior of the language model during the generation process. In this example, the following settings are provided:

generation_config = {
   "max_output_tokens": 8192,
   "temperature": 0,
   "top_p": 0.95,
}

From Google Gemini’s documentation:

max_output_tokens: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60–80 words.
temperature: The temperature is used for sampling during response generation, which occurs when top_p and top_k are applied. temperature controls the degree of randomness in token selection. Lower temperature values are good for prompts that require a less open-ended or creative response, while higher temperature values can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

A temperature of 0 means the model will choose the most likely token based on its training data, while higher values introduce more randomness and diversity in the output.

top_p: This parameter changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals the top-p value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1, respectively, and the top-p value is 0.5, then the model will select either A or B as the next token by using temperature and will exclude C as a candidate.

In this case, it is set to 0.95, meaning that only the top 95% of tokens with the highest probabilities will be considered during generation.

Beyond the above, the safety_settings dictionary specifies the harm categories and corresponding thresholds for filtering potentially harmful or inappropriate content from the generated output. In this example, the following settings are provided:

safety_settings = {
   generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

These settings instruct the Gemini 1.5 Flash model to block only highly harmful content related to hate speech, dangerous content, sexually explicit content, and harassment. Any content that falls below the “high” harm threshold for these categories will be allowed in the generated output.

The generate() function creates an instance of the GenerativeModel class, passing the MODEL parameter; in this example, Gemini 1.5 Flash. It then calls the generate_content() method on the model instance, providing the prompt, generation configuration, and safety settings. The stream=False parameter indicates that the generation should happen in a non-streaming mode, meaning the entire response will be generated and returned at once:

def generate(prompt):
 model = GenerativeModel(MODEL)
 responses = model.generate_content(
     [prompt],
     generation_config=generation_config,
     safety_settings=safety_settings,
     stream=False,
 )
 return(responses)

The generated response is stored in the responses variable, which is then returned by the generate() function.

By submitting the formatted prompt to Vertex AI’s API endpoint for Gemini 1.5 Flash, leveraging the provided generation configuration and safety settings, this RAG pipeline can obtain a contextualized and relevant response tailored to the user’s query while adhering to the specified parameters and content filtering rules.