You're reading from Generative AI Application Integration Patterns Integrate large language models into your applications

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835887608

Length 218 pages

Edition 1st Edition

Languages

Python

Tools

TensorFlow

Concepts

Artificial Intelligence

Authors (2):

Luis Lopez Soria

Juan Pablo Bustos

View More author details

Table of Contents (13) Chapters

Preface

1. Introduction to Generative AI Patterns FREE CHAPTER

2. Identifying Generative AI Use Cases

3. Designing Patterns for Interacting with Generative AI

4. Generative AI Batch and Real-Time Integration Patterns

5. Integration Pattern: Batch Metadata Extraction

6. Integration Pattern: Batch Summarization

7. Integration Pattern: Real-Time Intent Classification

8. Integration Pattern: Real-Time Retrieval Augmented Generation

9. Operationalizing Generative AI Integration Patterns

10. Embedding Responsible AI into Your GenAI Applications

11. Other Books You May Enjoy

12. Index

Architecture

To build our intent classification system, we’ll leverage a serverless, event-driven architecture built on Google Cloud (for example: https://cloud.google.com/architecture/serverless-functions-blueprint). This approach aligns with cloud-native principles and allows for seamless integration with other cloud services.

Figure 7.1: Intent classification example architecture diagram

The architecture consists of the following key components:

Ingestion layer: This layer is responsible for accepting incoming user inputs from various channels, such as web forms, chat interfaces, or API endpoints. We’ll use Google Cloud Functions as the entry point for our system, which can be triggered by events from services like Cloud Storage, Pub/Sub, or Cloud Run.
AI processing layer: In this layer, we’ll integrate Google’s Gemini Pro through Vertex AI. Vertex AI provides a managed environment for deploying and scaling machine learning models, ensuring high availability and performance.
Intent classification model: This is the core component of our system, responsible for analyzing the user input and determining the corresponding intent. We’ll leverage Google Gemini Pro’s natural language understanding capabilities for our intent classification model.
Orchestration and routing: Based on the classified intent, we’ll need to route the user input to the appropriate downstream system or service. This could involve integrating with customer relationship management (CRM) systems, knowledge bases, or other enterprise applications. We’ll use Cloud Functions or Cloud Run to orchestrate this routing process.
Monitoring and logging: To ensure the reliability and performance of our system, we’ll implement robust monitoring and logging mechanisms. We’ll leverage services like Cloud Logging, Cloud Monitoring, and Cloud Operations to gain visibility into our system’s behavior and quickly identify and resolve any issues.

By adopting this architecture, the intent classification system won’t just be scalable but also flexible enough to adapt to varying workloads and integration requirements. We’ll be able to handle high volumes of customer inquiries in real time and deliver swift and consistent responses that improve the overall customer experience.

The serverless nature of this architecture brings several additional benefits. It allows for automatic scaling based on demand, ensuring that we can handle sudden spikes in customer inquiries without manual intervention. This elasticity not only improves system reliability but also optimizes costs, as we only pay for the resources we actually use.

This event-driven design facilitates easy integration with other systems and services. As our customer service ecosystem evolves, we can easily add new triggers or outputs to our intent classification system.

This could include integrating with new communication channels, connecting to additional backend systems, or incorporating advanced analytics for deeper insights into customer behavior and preferences.

In the following sections, we’ll dive deeper into each component of our architecture, exploring the specific Google Cloud services we’ll use, best practices for implementation, and strategies for optimizing performance and cost-efficiency. We’ll also discuss a concrete example that will help you get started.

Entry point

For real-time interactive applications, the entry points where prompts originate need to be highly streamlined, with simplicity and ease of use in mind. These prompts often originate from unpredictable contexts, so interfaces have to feel natural across device types and usage scenarios.

In our use case, the entry point could be a web form, chat interface, or API endpoint where customers submit their inquiries. These inputs will be sent to a cloud function, which acts as the ingestion layer for our system.

Let’s start with a sample user query:

#In this case we will simulate the input from a chat interface
message = "I want to open an account"

Prompt pre-processing

In a real-time system, every step in the prompt pre-processing workflow adds precious latency, commonly measured in milliseconds or microseconds depending on your application’s SLAs, to the overall response time. Higher-latency experiences can be detrimental to the user experience. Therefore, pre-processing should be kept as lightweight as possible.

For our intent classification use case, the prompt pre-processing may involve simple text normalization, such as removing punctuation, converting to lowercase, or handling abbreviations. Additionally, we may apply some basic filtering to remove any potentially harmful or inappropriate content before sending the prompt to the model.

Let’s dive deep into an example prompt:

#In this section we define the prompt, as the task is to perform intent 
#classification we will identify the intent by exposing
#the possible values to the LLM
prompt_template = """
You are a helpful assistant for an online financial services company that allows users to check their balances, invest in certificates of deposit (CDs), and perform other financial transactions.
Your task is to identify what your customers are trying to do and return a well formed JSON object.
1. Carefully analyze the content of the message.
2. Classify what the user is trying to do within these options:
   * New Account: The user is trying to sign up. Return {{"intent": "signup", "content":"null"}}
   * Change Password: The user needs to reset their password. Return {{"intent":"change_password", "content":"null"}}
   * Check Balance: The user needs to check their balance. Return {{"intent": "check_balance", "content":"null"}}
   * Invest in CD: The user wants to invest in a certificate of deposit. Return {{"intent": "invest_cd", "content": "Extract relevant information such as investment amount and term"}}
   * Withdraw Funds: The user wants to withdraw money. Return {{"intent": "withdraw_funds", "content": "Extract information like amount and withdrawal method"}}
   * Transfer Funds: The user wants to transfer money between accounts. Return {{"intent": "transfer_funds", "content": "Extract information like amount, source account, and destination account"}}
   * Account Information: The user wants to access or update their account information. Return {{"intent": "account_info", "content": "Identify the specific information the user needs"}}
   * Lost/Stolen Card: The user wants to report a lost or stolen card. Return {{"intent": "lost_card", "content": "null"}}
   * Support: The user needs help and is not sure what to do. Return {{"intent": "support", "content": "null"}}
   * Other: For other queries, politely decline to answer and clarify what you can help with.
3. Only return the proper JSON result from your classification.
4. Always think step by step.
User question: {query}
JSON:
"""

The previous prompt defines the template for the intent classification task. The prompt provides context that explains that the assistant is helping users of an online financial services company perform various actions, such as signing up, checking balances, investing in CDs, withdrawing funds, and more.

Additionally, this prompt instructs the model to carefully analyze the user’s input message and classify the intent into one of the predefined categories. For each intent category, the prompt specifies the JSON object that should be returned, including any additional information that needs to be extracted from the user’s message.

For example, if the user’s intent is to invest in a CD, the assistant should return the JSON object in the following format:

{
   "intent": "invest_cd",
   "content": "Extract relevant information such as investment amount and term"
}

This means that the virtual assistant should not only identify the intent as "invest_cd" but also extract relevant information like the investment amount and term from the user’s message and include it in the "content" field.

The prompt also provides instructions for handling intents that do not fall into any of the predefined categories (the "Other" case).

By providing this detailed prompt template, the system can effectively guide the language model to perform the intent classification task for financial services scenarios, ensuring that the model’s responses are structured and formatted correctly.

Inference

At the inference stage, we’ll leverage Google’s Gemini Pro model hosted on Vertex AI. Within the cloud function triggered by the user input, we’ll invoke the Vertex AI endpoint hosting the Gemini Pro model, passing the pre-processed input as the prompt.

Gemini Pro will process the input and return the predicted intent, leveraging its natural language understanding capabilities. Since we’re using an out-of-the-box model, the underlying infrastructure and resource allocation are abstracted away, ensuring that individual requests are processed efficiently while adhering to the service’s performance and cost objectives:

generation_config = {
   "max_output_tokens": 8192,
   "temperature": 0,
   "top_p": 0.95,
}
safety_settings = {
   generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
   generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
}
def generate(prompt):
 vertexai.init(project=PROJECT, location=LOCATION)
 model = GenerativeModel(MODEL)
 responses = model.generate_content(
     [prompt],
     generation_config=generation_config,
     safety_settings=safety_settings,
     stream=False,
 )
 return(responses)
result = generate(prompt_template.format(query=message))

Result post-processing

For our intent classification use case, the post-processing step may involve formatting the predicted intent into a suitable response format, such as JSON or a human-readable string. Additionally, we may apply some basic filtering or ranking mechanisms to ensure that the most relevant and helpful responses are prioritized.

### Sometimes model return markdown friendly content, in this case we will implement a function to filter this.
def extract_json(text):
 """
 Extracts the JSON portion from a string containing backticks.
 Args:
   text: The string containing JSON data within backticks.
 Returns:
   A dictionary representing the extracted JSON, or None if no valid JSON is found.
 """
 start_index = text.find("```json")
 end_index = text.find("```", start_index + 7)  # +7 to skip "```json"
 if start_index != -1 and end_index != -1:
   json_string = text[start_index + 7: end_index]  # Extract the JSON string
 else:
   json_string = text
 try:
   json_data = json.loads(json_string)
   return json_data
 except json.JSONDecodeError:
   return None

The previous code snippet defines a function called extract_json that is designed to handle cases where the language model’s output contains JSON data wrapped in backticks: json```. This is a common practice in Markdown-friendly environments, where backticks are used to delineate code blocks or structured data.

The extract_json function takes a string text as input and attempts to extract the JSON portion from within the backticks. Here’s a breakdown of how the function works:

The function first looks for the string "```json" in the input text using the find method. This is the marker that indicates the start of a JSON block.
If the start marker is found, the function then looks for the closing "" marker by searching for it from the end of the json marker (start_index + 7). If both the start and end markers are found, the function extracts the JSON string by slicing the input text between these markers. If no start or end markers are found, the function assumes that the entire input text is the JSON string.
The function then attempts to parse the extracted JSON string using the json.loads method from the json module. If the parsing is successful, the function returns the resulting JSON data as a dictionary. If the parsing fails (for example, due to invalid JSON syntax), the function returns None. By incorporating this function into the post-processing stage, the system can handle cases where the language model’s output contains JSON data wrapped in backticks. This functionality can be particularly useful when working with Markdown-friendly environments or when integrating the intent classification system with other components that expect JSON-formatted data.
The post-processing stage can then proceed to format the extracted JSON data into a suitable response format, apply filtering or ranking mechanisms, and render the final response for display to the user.

The process_intent function is designed to handle the JSON data returned by the intent classification model. It takes a dictionary intent as input, which is expected to have an “intent” key with a value representing the predicted intent category.

def process_intent(intent):
 if intent["intent"] == "signup":
   #If a user is trying to sign up you could 
   #redirect the to a sign up page for example.
   return("Sign up process")
 elif intent["intent"] == "change_password":
   #If a user is looking into changing their password, 
   #you could either do it through the chatbot, 
   #or redirect to a password change page.
   return("Change password")
 elif intent["intent"] == "check_balance":
   #In this case you could have a function that 
   #would query a database to obtain the 
   #balance (as long as the user is logged in or not)
   return("Check account balance")
 elif intent["intent"] == "invest_cd":
   #For the investment intent, this could redirect 
   #to a page where investment options can be selected.
   return("Invest in a CD")
 elif intent["intent"] == "withdraw_funds":
   return("Withdraw funds")
 elif intent["intent"] == "transfer_funds":
   return("Transfer funds")
 elif intent["intent"] == "account_info":
   return("Account information")
 elif intent["intent"] == "lost_card":
   return("Report lost card")
 elif intent["intent"] == "support":
   return("Contact support")
 elif intent["intent"] == "other":
   return("Other kind of intent")
 else:
   return("If a intent was classified as something else you should investigate what is going on.")
intent = process_intent(extract_json(result.text))

The process_intent function checks the value of the "intent" key in the input dictionary. Depending on the intent category, the function performs a specific action or returns a corresponding message.

For example, if the intent is "signup", the function returns the string "Sign up process", which could be used to redirect the user to a sign-up page or initiate the sign-up process. Similarly, if the intent is "change_password", the function returns "Change password", which could trigger a password reset process or redirect the user to a password change page.

For intents like "check_balance", "invest_cd", "withdraw_funds", "transfer_funds", "account_info", "lost_card", and "support", the function returns corresponding messages that could be used to initiate the relevant processes or provide instructions to the user.

If the intent is "other", the function returns "Other kind of intent", indicating that the user’s query did not match any of the predefined intent categories.

If the intent does not match any of the cases handled by the function, it returns a message suggesting that further investigation is needed to understand the intent.

Finally, the last line of code intent = process_intent(extract_json(result.text)) combines the extract_json and process_intent functions. It first extracts the JSON data from the result.text string using extract_json. Then, it passes the extracted JSON data to the process_intent function, which processes the intent and returns an appropriate message or action.

This code snippet demonstrates how the intent classification system can be integrated with further processing steps to handle different user intents. The process_intent function can be extended or modified to include additional logic or actions based on the specific requirements of the application.

Result presentation

The result presentation stage for real-time applications demands instantaneous updates, often server-rendered or via data-binding frameworks.

In our use case, the formatted response containing the predicted intent can be sent back to the customer through the channel from which the inquiry originated (for example, web form, chat interface, or API response). This response can then be used to route the inquiry to the appropriate downstream system or provide an automated response for common intents.

In this example, we will use a Gradio interface to render the replies in a visually appealing UI. Gradio (https://www.gradio.app/) is an open-source Python package that allows you to quickly create easy-to-use, customizable UI components for your ML model, any API, or even an arbitrary Python function using a few lines of code.

You can find more information about Gradio using the following links:

Docs: https://www.gradio.app/docs

GitHub: https://github.com/gradio-app/gradio

The following code provides an example that creates a Gradio interface:

import gradio as gr
def chat(message, history):
   response = generate(prompt_template.format(query=message))
   intent_action = process_intent(extract_json(response.text))
   history.append((message, intent_action))
   return "", history
with gr.Blocks() as demo:
 gr.Markdown("Fintech Assistant")
 chatbot = gr.Chatbot(show_label=False)
 message = gr.Textbox(placeholder="Enter your question")
 message.submit(chat, [message, chatbot],[message, chatbot]  )
demo.launch(debug=True)

The previous code illustrates the result presentation stage for the intent classification system using the Gradio library.

In our example, the chat(message, history) function is the core of the chatbot interface. It takes two arguments: message (the user’s input message) and history (a list containing the previous messages and responses). Here’s what the function does:

It calls the generate function (not shown in the provided code) to get the response from the intent classification model, passing the user’s message as part of the prompt template. It then processes the model’s response using the extract_json function (not shown) to extract the predicted intent data.
The extracted intent data is passed to the process_intent function (which is not shown) to determine the appropriate action or response based on the predicted intent. The user’s message and the generated response are appended to the history list, which keeps track of the conversation.
The function returns an empty string for the response message and the updated history list.
The code then creates a Gradio interface using the gr.Blocks context manager. Inside the context, it does the following:
- Displays a title using the gr.Markdown component.
- Creates a gr.Chatbot component to display the conversation history.
- Creates a gr.Textbox component for the user to enter their message.
- Binds the chat function to the submit event of the Textbox component. When the user submits their message, the chat function is called with the user’s message and the current history as arguments.
- Updates the Textbox and Chatbot components with the new message and updated history, respectively.
- Launches the Gradio interface in debug mode using demo.launch(debug=True).

The result is an interactive chatbot interface where users can enter their messages as illustrated in Figure 7.2, and the system will process the message, predict the intent, and provide an appropriate response based on the process_intent function. The conversation history is displayed in the Chatbot component, allowing users to track the flow of the conversation.

Figure 7.2: Example Gradio interface

Logging and monitoring

Real-time systems require tight instrumentation around per-request metrics, such as latencies, errors, and resource usage.

In our architecture, we’ll leverage services like Cloud Logging (https://cloud.google.com/logging/docs/overview) and Cloud Monitoring (https://cloud.google.com/monitoring/docs/monitoring-overview) to gain visibility into the system’s behavior and quickly identify and resolve any issues. We can monitor metrics like request latency, error rates, and resource utilization, and set up alerts for anomalies or performance degradation.

By following this integration pattern and leveraging the power of Google’s Gemini Pro, businesses can unlock the power of generative AI to build intelligent systems that accurately classify user intents, enhance customer experiences, and streamline operations.

Refer to the GitHub directory of this chapter for the complete code that demonstrates how all the pieces described above fit together.