Deploying LLMs with Amazon SageMaker

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

In the first part of this post, we showed how easy it is to deploy large language models (LLMs) in the cloud using a managed machine learning service called Amazon SageMaker. In just a few steps, we were able to deploy a MistralLite model in a SageMaker Inference Endpoint. If you’ve worked on real ML-powered projects in the past, you probably know that deploying a model is just the first step! There are definitely a few more steps before we can consider that our application is ready for use.

If you’re looking for the link to the first part, here it is: Deploying LLMs with Amazon SageMaker - Part 1

In this post, we’ll build on top of what we already have in Part 1 and prepare a demo user interface for our chatbot application. That said, we will tackle the following sections in this post:

~~● Section I: Preparing the SageMaker Notebook Instance~~ ~~(discussed in Part 1)~~

~~● Section II: Deploying an LLM using the SageMaker Python SDK to a SageMaker Inference Endpoint~~ ~~(discussed in Part 1)~~

● Section III: Enabling Data Capture with SageMaker Model Monitor

● Section IV: Invoking the SageMaker inference endpoint using the boto3 client

● Section V: Preparing a Demo UI for our chatbot application

● Section VI: Cleaning Up

Without further ado, let’s begin!

Section III: Enabling Data Capture with SageMaker Model Monitor

In order to analyze our deployed LLM, it’s essential that we’re able to collect the requests and responses to a central storage location. Instead of building our own solution that collects the information we need, we can just utilize the built-in Model Monitor capability of SageMaker. Here, all we need to do is prepare the configuration details and run the update_data_capture_config() method of the inference endpoint object and we’ll have the data capture setup enabled right away! That being said, let’s proceed with the steps required to enable and test data capture for our SageMaker Inference endpoint:

STEP # 01: Continuing where we left off in Part 1 of this post, let’s get the bucket name of the default bucket used by our session:

s3_bucket_name = sagemaker_session.default_bucket()
s3_bucket_name

STEP # 02: In addition to this, let’s prepare and define a few prerequisites as well:

prefix = "llm-deployment"
base = f"s3://{s3_bucket_name}/{prefix}"
s3_capture_upload_path = f"{base}/model-monitor"

STEP # 03: Next, let’s define the data capture config:

from sagemaker.model_monitor import DataCaptureConfig

data_capture_config = DataCaptureConfig(
    enable_capture = True,
    sampling_percentage=100,
    destination_s3_uri=s3_capture_upload_path,
    kms_key_id=None,
    capture_options=["REQUEST", "RESPONSE"],
    csv_content_types=["text/csv"],
    json_content_types=["application/json"]
)

Here, we specify that we’ll be collecting 100% of the requests and responses that pass through the deployed model.

STEP # 04: Let’s enable data capture so that we’re able to save in Amazon S3 the request and response data:

predictor.update_data_capture_config(
    data_capture_config=data_capture_config
)

Note that this step may take about 8-10 minutes to complete. Feel free to grab a cup of coffee or tea while waiting!

STEP # 05: Let’s check if we are able to capture the input request and output response by performing another sample request:

result = predictor.predict(input_data)[0]["generated_text"]
print(result)

This should yield the following output:

"The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries. There is no single answer that can be definitively proven, as the meaning of life is subjective and can vary greatly from person to person.\n\nSome people believe that the meaning of life is to find happiness and fulfillment through personal growth, relationships, and experiences. Others believe that the meaning of life is to serve a greater purpose, such as through a religious or spiritual calling, or by making a positive impact on the world through their work or actions.\n\nUltimately, the meaning of life is a personal journey that each individual must discover for themselves. It may involve exploring different beliefs and perspectives, seeking out new experiences, and reflecting on what brings joy and purpose to one's life."

Note that it may take a minute or two before the .jsonl file(s) containing the request and response data appear in our S3 bucket.

STEP # 06: Let’s prepare a few more examples:

prompt_examples = [
    "What is the meaning of life?",
    "What is the color of love?",
    "How to deploy LLMs using SageMaker",
    "When do we use Bedrock and when do we use SageMaker?"
]

STEP # 07: Let’s also define the perform_request() function which wraps the relevant lines of code for performing a request to our deployed LLM model:

def perform_request(prompt, predictor):
    input_data = {
        "inputs": f"<|prompter|>{prompt}</s><|assistant|>",
        "parameters": {
            "do_sample": False,
            "max_new_tokens": 2000,
            "return_full_text": False,
        }
    }
   
    response = predictor.predict(input_data)
    return response[0]["generated_text"]

STEP # 08: Let’s quickly test the perform_request() function:

perform_request(prompt_examples[0], predictor=predictor)

STEP # 09: With everything ready, let’s use the perform_request() function to perform requests using the examples we’ve prepared in an earlier step:

from time import sleep

for example in prompt_examples:
    print("Input:", example)
   
    generated = perform_request(
        prompt=example,
        predictor=predictor
    )
    print("Output:", generated)
    print("-"*20)
    sleep(1)

This should return the following:

Input: What is the meaning of life?
...
--------------------
Input: What is the color of love?
Output: The color of love is often associated with red, which is a vibrant and passionate color that is often used to represent love and romance. Red is a warm and intense color that can evoke strong emotions, making it a popular choice for representing love.

However, the color of love is not limited to red. Other colors that are often associated with love include pink, which is a softer and more feminine shade of red, and white, which is often used to represent purity and innocence.

Ultimately, the color of love is subjective and can vary depending on personal preferences and cultural associations. Some people may associate love with other colors, such as green, which is often used to represent growth and renewal, or blue, which is often used to represent trust and loyalty.
...

Note that this is just a portion of the overall output and you should get a relatively long response for each input prompt.

Section IV: Invoking the SageMaker inference endpoint using the boto3 client

While it’s convenient to use the SageMaker Python SDK to invoke our inference endpoint, it’s best that we also know how to use boto3 as well to invoke our deployed model. This will allow us to invoke the inference endpoint from an AWS Lambda function using boto3.

deploying-llms-with-amazon-sagemaker-part-2-img-0

Image 10 — Utilizing API Gateway and AWS Lambda to invoke the deployed LLM

This Lambda function would then be triggered by an event from an API Gateway resource similar to what we have in Image 10. Note that we’re not planning to complete the entire setup in this post but having a working example of how to use boto3 to invoke the SageMaker inference endpoint should easily allow you to build an entire working serverless application utilizing API Gateway and AWS Lambda.

STEP # 01: Let’s quickly check the endpoint name of the SageMaker inference endpoint:

predictor.endpoint_name

This should return the endpoint name with a format similar to what we have below:

'MistralLite-HKGKFRXURT'

STEP # 02: Let’s prepare our boto3 client using the following lines of code:

import boto3
import json

boto3_client = boto3.client('runtime.sagemaker')

STEP # 03: Now, let’s invoke the endpoint

body = json.dumps(input_data).encode()

response = boto3_client.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    ContentType='application/json',
    Body=body
)
   
result = json.loads(response['Body'].read().decode())

STEP # 04: Let’s quickly inspect the result:

result

This should give us the following:

[{'generated_text': "The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries. There is no single answer that can be definitively proven, as the meaning of life is subjective and can vary greatly from person to person..."}]

STEP # 05: Let’s try that again and print the output text:

result[0]['generated_text']

This should yield the following output:

"The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries..."

STEP # 06: Now, let’s define perform_request_2 which uses the boto3 client to invoke our deployed LLM:

def perform_request_2(prompt, boto3_client, predictor):
    input_data = {
        "inputs": f"<|prompter|>{prompt}</s><|assistant|>",
        "parameters": {
            "do_sample": False,
            "max_new_tokens": 2000,
            "return_full_text": False,
        }
    }
   
    body = json.dumps(input_data).encode()

    response = boto3_client.invoke_endpoint(
        EndpointName=predictor.endpoint_name,
        ContentType='application/json',
        Body=body
    )
   
    result = json.loads(response['Body'].read().decode())

    return result[0]["generated_text"]

STEP # 07: Next, let’s run the following block of code to have our deployed LLM answer the same set of questions using the perform_request_2() function:

for example in prompt_examples:
    print("Input:", example)
   
    generated = perform_request_2(
        prompt=example,
        boto3_client=boto3_client,
        predictor=predictor
    )
    print("Output:", generated)
    print("-"*20)
    sleep(1)

This will give us the following output:

Input: What is the meaning of life?
...
--------------------
Input: What is the color of love?
Output: The color of love is often associated with red, which is a vibrant and passionate color that is often used to represent love and romance. Red is a warm and intense color that can evoke strong emotions, making it a popular choice for representing love.

However, the color of love is not limited to red. Other colors that are often associated with love include pink, which is a softer and more feminine shade of red, and white, which is often used to represent purity and innocence.

Ultimately, the color of love is subjective and can vary depending on personal preferences and cultural associations. Some people may associate love with other colors, such as green, which is often used to represent growth and renewal, or blue, which is often used to represent trust and loyalty.
...

Given that it may take a few minutes before the .jsonl files appear in our S3 bucket, let’s wait for about 3-5 minutes before proceeding to the next section. Feel free to grab a cup of coffee or tea while waiting!

STEP # 08: Let’s run the following block of code to list the captured data files stored in our S3 bucket:

results = !aws s3 ls {s3_capture_upload_path} --recursive
results

STEP # 09: In addition to this, let’s store the list inside the processed variable:

processed = []

for result in results:
    partial = result.split()[-1]
    path = f"s3://{s3_bucket_name}/{partial}"
    processed.append(path)
   
processed

STEP # 10: Let’s create a new directory named captured_data using the mkdir command:

!mkdir -p captured_data

STEP # 11: Now, let’s download the .jsonl files from the S3 bucket to the captured_data directory in our SageMaker Notebook Instance:

for index, path in enumerate(processed):
    print(index, path)
    !aws s3 cp {path} captured_data/{index}.jsonl

STEP # 12: Let’s define the load_json_file() function which will help us load files with JSON content:

import json

def load_json_file(path):
    output = []
   
    with open(path) as f:
        output = [json.loads(line) for line in f]
       
    return output

STEP # 13: Using the load_json_file() function we defined in an earlier step, let’s load the .jsonl files and store them inside the all variable for easier viewing:

all = []

for i, _ in enumerate(processed):
    print(f">: {i}")
    new_records = load_json_file(f"captured_data/{i}.jsonl")
    all = all + new_records
   
   
all

Running this will yield the following response:

deploying-llms-with-amazon-sagemaker-part-2-img-1

Image 11 — All captured data points inside the all variable

Feel free to analyze the nested structure stored in all variables. In case you’re interested in how this captured data can be analyzed and processed further, you may check Chapter 8, Model Monitoring and Management Solutions of my 2nd book “Machine Learning Engineering on AWS”.

Section V: Preparing a Demo UI for our chatbot application

Years ago, we had to spend a few hours to a few days before we were able to prepare a user interface for a working demo. If you have not used Gradio before, you would be surprised that it only takes a few lines of code to set everything up. In the next set of steps, we’ll do just that and utilize the model we’ve deployed in the previous parts of our demo application:

STEP # 01: Continuing where we left off in the previous part, let’s install a specific version of gradio using the following command:

!pip install gradio==3.49.0

STEP # 02: We’ll also be using a specific version of fastapi as well:

!pip uninstall -y fastapi
!pip install fastapi==0.103.1

STEP # 03: Let’s prepare a few examples and store them in a list:

prompt_examples = [
    "What is the meaning of life?",
    "What is the color of love?",
    "How to deploy LLMs using SageMaker",
    "When do we use Bedrock and when do we use SageMaker?",
    "Try again",
    "Provide 10 alternatives",
    "Summarize the previous answer into at most 2 sentences"
]

STEP # 04: In addition to this, let’s define the parameters using the following block of code:

parameters = {
    "do_sample": False,
    "max_new_tokens": 2000,
}

STEP # 05: Next, define the process_and_response() function which we’ll use to invoke the inference endpoint:

def process_and_respond(message, chat_history):
    processed_chat_history = ""

    if len(chat_history) > 0:
        for chat in chat_history:
            processed_chat_history += f"<|prompter|>{chat[0]}</s><|assistant|>{chat[1]}</s>"

           
    prompt = f"{processed_chat_history}<|prompter|>{message}</s><|assistant|>"
    response = predictor.predict({"inputs": prompt, "parameters": parameters})

    parsed_response = response[0]["generated_text"][len(prompt):]
    chat_history.append((message, parsed_response))
    return "", chat_history

STEP # 06: Now, let’s set up and prepare the user interface we’ll use to interact with our chatbot:

import gradio as gr

with gr.Blocks(theme=gr.themes.Monochrome(spacing_size="sm")) as demo:
    with gr.Row():
        with gr.Column():
           
            message = gr.Textbox(label="Chat Message Box",
                                 placeholder="Input message here",
                                 show_label=True,
                                 lines=12)


            submit = gr.Button("Submit")
           
            examples = gr.Examples(examples=prompt_examples,
                                   inputs=message)
        with gr.Column():
            chatbot = gr.Chatbot(height=900)
   
    submit.click(process_and_respond,
                 [message, chatbot],
                 [message, chatbot],
                 queue=False)

Here, we can see the power of Gradio as we only needed a few lines of code to prepare a demo app.

STEP # 07: Now, let’s launch our demo application using the launch() method:

demo.launch(share=True, auth=("admin", "replacethis1234!"))

This will yield the following logs:

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://123456789012345.gradio.live

STEP # 08: Open the public URL in a new browser tab. This will load a login page which will require us to input the username and password before we are able to access the chatbot.

deploying-llms-with-amazon-sagemaker-part-2-img-2

Image 12 — Login page

Specify admin and replacethis1234! in the login form to proceed.

STEP # 09: After signing in using the credentials, we’ll be able to access a chat interface similar to what we have in Image 13. Here, we can try out various types of prompts.

deploying-llms-with-amazon-sagemaker-part-2-img-3

Image 13 — The chatbot interface

Here, we have a Chat Message Box where we can input and run our different prompts on the left side of the screen. We would then see the current conversation on the right side.

STEP # 10: Click the first example “What is the meaning of life?”. This will auto-populate the text area similar to what we have in Image 14:

deploying-llms-with-amazon-sagemaker-part-2-img-4

Image 14 — Using one of the examples to populate the Chat Message Box

STEP # 11:Click the Submit button afterwards. After a few seconds, we should get the following response in the chat box:

deploying-llms-with-amazon-sagemaker-part-2-img-5

Image 15 — Response of the deployed model

Amazing, right? Here, we just asked the AI what the meaning of life is.

STEP # 12: Click the last example “Summarize the previous answer into at most 2 sentences”. This will auto-populate the text area with the said example. Click the Submit button afterward.

deploying-llms-with-amazon-sagemaker-part-2-img-6

Image 16 — Summarizing the previous answer into at most 2 sentences

Feel free to try other prompts. Note that we are not limited to the prompts available in the list of examples in the interface.

Important Note: Like other similar AI/ML solutions, there's the risk of hallucinations or the generation of misleading information. That said, it's critical that we exercise caution and validate the outputs produced by any Generative AI-powered system to ensure the accuracy of the results.

Section VI: Cleaning Up

We’re not done yet! Cleaning up the resources we’ve created and launched is a very important step as this will help us ensure that we don’t pay for the resources we’re not planning to use.

STEP # 01: Once you’re done trying out various types of prompts, feel free to turn off and clean up the resources launched and created using the following lines of code:

demo.close()
predictor.delete_endpoint()

STEP # 02: Make sure to turn off (or delete) the SageMaker Notebook instance as well. I’ll leave this to you as an exercise!

Wasn’t that easy?! As you can see, deploying LLMs with Amazon SageMaker is straightforward and easy. Given that Amazon SageMaker handles most of the heavy lifting to manage the infrastructure, we’re able to focus more on the deployment of our machine learning model. We are just scratching the surface as there is a long list of capabilities and features available in SageMaker. If you want to take things to the next level, feel free to read 2 of my books focusing heavily on SageMaker: “Machine Learning with Amazon SageMaker Cookbook” and “Machine Learning Engineering on AWS”.

Author Bio

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO of 3 Australian-owned companies and also served as the Director for Software Development and Engineering for multiple e-commerce startups in the past. Years ago, he and his team won 1st place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and he has been sharing his knowledge in several international conferences to discuss practical strategies on machine learning, engineering, security, and management. He is also the author of the books "Machine Learning with Amazon SageMaker Cookbook", "Machine Learning Engineering on AWS", and "Building and Automating Penetration Testing Labs in the Cloud". Due to his proven track record in leading digital transformation within organizations, he has been recognized as one of the prestigious Orange Boomerang: Digital Leader of the Year 2023 award winners.