Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Generative AI Application Integration Patterns

You're reading from   Generative AI Application Integration Patterns Integrate large language models into your applications

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781835887608
Length 218 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Luis Lopez Soria Luis Lopez Soria
Author Profile Icon Luis Lopez Soria
Luis Lopez Soria
Juan Pablo Bustos Juan Pablo Bustos
Author Profile Icon Juan Pablo Bustos
Juan Pablo Bustos
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Introduction to Generative AI Patterns FREE CHAPTER 2. Identifying Generative AI Use Cases 3. Designing Patterns for Interacting with Generative AI 4. Generative AI Batch and Real-Time Integration Patterns 5. Integration Pattern: Batch Metadata Extraction 6. Integration Pattern: Batch Summarization 7. Integration Pattern: Real-Time Intent Classification 8. Integration Pattern: Real-Time Retrieval Augmented Generation 9. Operationalizing Generative AI Integration Patterns 10. Embedding Responsible AI into Your GenAI Applications 11. Other Books You May Enjoy
12. Index

Inference

After constructing the prompt with the retrieved context and the user’s query, the next step is to submit the formatted prompt directly to Vertex AI’s API endpoint to be processed by Gemini 1.5 Flash. This is where the actual generation of the response takes place. In the following code snippet, the generate() function is responsible for sending the prompt to the Gemini 1.5 Flash model and obtaining the generated response:

#This is the section where we submit the full prompt and 
#context to the LLM
result = generate(prompt)

The generate() function encapsulates the configuration and settings required for the generation process. It includes two main components: generation_config and safety_settings.

The generation_config dictionary specifies the parameters that control the behavior of the language model during the generation process. In this example, the following settings are provided:

generation_config = {
   "max_output_tokens": 8192,
   "temperature": 0,
   "top_p": 0.95,
}

From Google Gemini’s documentation:

  • max_output_tokens: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60–80 words.
  • temperature: The temperature is used for sampling during response generation, which occurs when top_p and top_k are applied. temperature controls the degree of randomness in token selection. Lower temperature values are good for prompts that require a less open-ended or creative response, while higher temperature values can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

A temperature of 0 means the model will choose the most likely token based on its training data, while higher values introduce more randomness and diversity in the output.

  • top_p: This parameter changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals the top-p value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1, respectively, and the top-p value is 0.5, then the model will select either A or B as the next token by using temperature and will exclude C as a candidate.

In this case, it is set to 0.95, meaning that only the top 95% of tokens with the highest probabilities will be considered during generation.

Beyond the above, the safety_settings dictionary specifies the harm categories and corresponding thresholds for filtering potentially harmful or inappropriate content from the generated output. In this example, the following settings are provided:

safety_settings = {
   generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

These settings instruct the Gemini 1.5 Flash model to block only highly harmful content related to hate speech, dangerous content, sexually explicit content, and harassment. Any content that falls below the “high” harm threshold for these categories will be allowed in the generated output.

The generate() function creates an instance of the GenerativeModel class, passing the MODEL parameter; in this example, Gemini 1.5 Flash. It then calls the generate_content() method on the model instance, providing the prompt, generation configuration, and safety settings. The stream=False parameter indicates that the generation should happen in a non-streaming mode, meaning the entire response will be generated and returned at once:

def generate(prompt):
 model = GenerativeModel(MODEL)
 responses = model.generate_content(
     [prompt],
     generation_config=generation_config,
     safety_settings=safety_settings,
     stream=False,
 )
 return(responses)

The generated response is stored in the responses variable, which is then returned by the generate() function.

By submitting the formatted prompt to Vertex AI’s API endpoint for Gemini 1.5 Flash, leveraging the provided generation configuration and safety settings, this RAG pipeline can obtain a contextualized and relevant response tailored to the user’s query while adhering to the specified parameters and content filtering rules.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image