Deploying LLMs with Amazon SageMaker

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

Have you ever tried asking a Generative AI-powered chatbot the question: “What is the meaning of life?”. In case you have not tried that yet, here’s the response I got when I tried that myself using a custom chatbot app I built with a managed machine learning (ML) service called Amazon SageMaker.

deploying-llms-with-amazon-sagemaker-part-1-img-0

Image 01 — Asking a chatbot the meaning of life

You would be surprised that I built this quick demo application myself in just a few hours! In this post, I will teach you how to deploy your own Large Language Models (LLMs) in a SageMaker Inference Endpoint (that is, a machine learning-powered server that responds to inputs) with just a few lines of code.

deploying-llms-with-amazon-sagemaker-part-1-img-1

Image 02 — Deploying an LLM to a SageMaker Inference Endpoint

While most tutorials available teach us how to utilize existing Application Programming Interfaces (APIs) to prepare chatbot applications, it’s best that we also know how to deploy LLMs in our own servers in order to guarantee data privacy and compliance. In addition to this, we’ll be able to manage the long-term costs of our AI-powered systems as well. One of the most powerful solutions available for these types of requirements is Amazon SageMaker which helps us focus on the work we need to do instead of worrying about cloud infrastructure management.

We’ll divide the hands-on portion into the following sections:

● Section I: Preparing the SageMaker Notebook Instance

● Section II: Deploying an LLM using the SageMaker Python SDK to a SageMaker Inference Endpoint

● Section III: Enabling Data Capture with SageMaker Model Monitor (discussed in Part 2)

● Section IV: Invoking the SageMaker inference endpoint using the boto3 client (discussed in Part 2)

● Section V: Preparing a Demo UI for our chatbot application (discussed in Part 2)

● Section VI: Cleaning Up (discussed in Part 2)

Without further ado, let’s begin!

Section I: Preparing the SageMaker Notebook Instance

Let’s start by creating a SageMaker Notebook instance. Note that while we can also do this in SageMaker Studio, running the example in a Sagemaker Notebook Instance should do the trick. If this is your first time launching a SageMaker Notebook instance, you can think of it as your local machine with several tools pre-installed already where we can run our scripts.

STEP # 01: Sign in to your AWS account and navigate to the SageMaker console by typing sagemaker in the search box similar to what we have in the following image:

deploying-llms-with-amazon-sagemaker-part-1-img-2

Image 03 — Navigating to the SageMaker console

Choose Amazon SageMaker from the list of options available as highlighted in Image 03.

STEP # 02: In the sidebar, locate and click Notebook instances under Notebook:

deploying-llms-with-amazon-sagemaker-part-1-img-3

Image 04 — Locating Notebook instances in the sidebar

STEP # 03: Next, locate and click the Create notebook instance button.

STEP # 04: In the Create notebook instance page, you’ll be asked to input a few configuration parameters before we’re able to launch the notebook instance where we’ll be running our code:

deploying-llms-with-amazon-sagemaker-part-1-img-4

Image 05 — Creating a new SageMaker Notebook instance

Specify a Notebook instance name (for example, llm-demo) and select a Notebook instance type. For best results, you may select a relatively powerful instance type (ml.m4.xlarge) where we will run the scripts. However, you may decide to choose a smaller instance type such as ml.t3.medium (slower but less expensive). Note that we will not be deploying our LLM inside this notebook instance as the model will be deployed in a separate inference endpoint (which will require a more powerful instance type such as an ml.g5.2xlarge).

STEP # 05:Create an IAM role by choosing Create a new role from the list of options available in the IAM role dropdown (under Permissions and encryption).

deploying-llms-with-amazon-sagemaker-part-1-img-5

Image 06 — Opening the Jupyter app

This will open the following popup window. Given that we’re just working on a demo application, the default security configuration should do the trick. Click the Create role button.

Important Note: Make sure to have a more secure configuration when dealing with production (or staging) work environments.Won’t dive deep into how cloud security works in this post so feel free to look for other resources and references to further improve the current security setup. In case you are interested to learn more about cloud security, feel free to check my 3rd book “Building and Automating Penetration Testing Labs in the Cloud”. In the 7th Chapter of the book (Setting Up an IAM Privilege Escalation Lab), you’ll learn how misconfigured machine learning environments on AWS can easily be exploited with the right sequence of steps.

STEP #06: Click the Create notebook instance button. Wait for about 5-10 minutes for the SageMaker Notebook instance to be ready.

Important Note: Given that this will launch a resource that will run until you turn it off (or delete it), make sure to complete all the steps in the 2nd part of this post and clean up the created resources accordingly.

STEP # 07:Once the instance is ready, click Open Jupyter similar to what we have in Image 07:

deploying-llms-with-amazon-sagemaker-part-1-img-6

Image 07 — Opening the Jupyter app

This will open the Jupyter application in a browser tab. If this is your first time using this application, do not worry as detailed instructions will be provided in the succeeding steps to help you get familiar with this tool.

STEP # 08:Create a new notebook by clicking New and selecting conda_python3 from the list of options available:

deploying-llms-with-amazon-sagemaker-part-1-img-7

Image 08 — Creating a new notebook using the conda_python3 kernel

In case you are wondering about what a kernel is, it is simply an “engine” or “environment” with pre-installed libraries and prerequisites that executes the code specified in the notebook cells. You’ll see this in action in a bit.

STEP # 09:At this point, we should see the following interface where we can run various types of scripts and blocks of code:

deploying-llms-with-amazon-sagemaker-part-1-img-8

Image 09 — New Jupyter notebook

Feel free to rename the Jupyter Notebook before proceeding to the next step. If you have not used a Jupyter Notebook before, you may run the following line of code by typing the following in the text field and pressing SHIFT + ENTER.

 print('hello')

This should print the output hello right below the text field where we placed our code.

Section II: Deploying an LLM using the SageMaker Python SDK to a SageMaker Inference Endpoint

STEP # 01: With everything ready, let’s start by installing a specific version of the SageMaker Python SDK:

 !pip install sagemaker==2.192.1

Here, we’ll be using v2.192.1. This will help us ensure that you won’t encounter breaking changes even if you work on the hands-on solutions in this post at a later date.

In case you are wondering what the SageMaker Python SDK is, it is simply a software development kit (SDK) with the set of tools and APIs to help developers interact with and utilize the different features and capabilities of Amazon SageMaker.

STEP # 02: Next, let’s import and prepare a few prerequisites by running the following block of code:

import sagemaker
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

STEP # 03: Let’s import HuggingFaceModel and get_huggingface_llm_image_uri as well:

from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

STEP # 04: Next, let’s define the generate_random_label() function which we’ll use later when naming our resources:

from string import ascii_uppercase
from random import choice

def generate_random_label():
    letters = ascii_uppercase
   
    return ''.join(choice(letters) for i in range(10))

This will help us avoid naming conflicts when creating and configuring our resources.

STEP # 05: Use the get_huggingface_llm_image_uri function we imported in an earlier step to retrieve the container image URI for our LLM. In addition to this, let’s define the model_name we’ll use later when deploying our LLM to a SageMaker endpoint:

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region,
  version="1.1.0"
)

model_name = "MistralLite-" + generate_random_label()

STEP # 06: Before, we proceed with the actual deployment, let’s quickly inspect what we have in the image_uri variable:

image_uri

This will output the following variable value:

'763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'

STEP # 07: Similarly, let’s check the variable value of model_name

 model_name

This will give us the following:

'MistralLite-HKGKFRXURT'

Note that you’ll get a different model_name value since we’re randomly generating a portion of the model name

STEP # 08: Let’s prepare the hub model configuration as well:

hub_env = {
  'HF_MODEL_ID': 'amazon/MistralLite',
  'HF_TASK': 'text-generation',
  'SM_NUM_GPUS': '1',
  "MAX_INPUT_LENGTH": '16000',
  "MAX_TOTAL_TOKENS": '16384',
  "MAX_BATCH_PREFILL_TOKENS": '16384',
  "MAX_BATCH_TOTAL_TOKENS":  '16384',
}

Here, we specify that we’ll be using the MistralLite model. If this is your first time hearing out MistralLite, it is a fine-tuned Mistral-7B-v0.1 language model. It can perform significantly better on several long context retrieve and answering tasks. For more information, feel free to check: https://huggingface.co/amazon/MistralLite.

STEP # 09: Let’s initialize the HuggingFaceModel object using some of the prerequisites and variables we’ve prepared in the earlier steps:

model = HuggingFaceModel(
    name=model_name,
    env=hub_env,
    role=role,
    image_uri=image_uri
)

STEP # 10: Now, let’s proceed with the deployment of the model using the deploy() method:

predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name,
)

Here, we’re using an ml.g5.2xlarge for our inference endpoint.

Given that this step may take about 10-15 minutes to complete, feel free to grab a cup of coffee or tea while waiting!

STEP # 11: Now, let’s prepare our first input data:

question = "What is the meaning of life?"

input_data = {
  "inputs": f"<|prompter|>{question}</s><|assistant|>",
  "parameters": {
    "do_sample": False,
    "max_new_tokens": 2000,
    "return_full_text": False,
  }
}

STEP # 12: With the prerequisites ready, let’s have our deployed LLM process the input data we prepared in the previous step:

result = predictor.predict(input_data)[0]["generated_text"]
print(result)

This should yield the following output:

The meaning of life is a philosophical question that has been debated by thinkers and philosophers for centuries. There is no single answer that can be definitively proven, as the meaning of life is subjective and can vary greatly from person to person.
...

Looks like our SageMaker Inference endpoint (where the LLM is deployed) is working just fine!

Conclusion

That wraps up the first part of this post. At this point, you should have a good idea of how to deploy LLMs using Amazon SageMaker. However, there’s more in store for us in the second part as we’ll build on top of what we have already and enable data capture to help us collect and analyze the data (that is, the input requests and output responses) that pass through the inference endpoint. In addition to this, we’ll prepare a demo user interface utilizing the ML model we deployed in this post.

If you’re looking for the link to the second part, here it is: Deploying LLMs with Amazon SageMaker - Part 2

We are just scratching the surface as there is a long list of capabilities and features available in SageMaker. If you want to take things to the next level, feel free to read 2 of my books focusing heavily on SageMaker: “Machine Learning with Amazon SageMaker Cookbook” and “Machine Learning Engineering on AWS”.

Author Bio

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO of 3 Australian-owned companies and also served as the Director for Software Development and Engineering for multiple e-commerce startups in the past. Years ago, he and his team won 1st place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and he has been sharing his knowledge in several international conferences to discuss practical strategies on machine learning, engineering, security, and management. He is also the author of the books "Machine Learning with Amazon SageMaker Cookbook", "Machine Learning Engineering on AWS", and "Building and Automating Penetration Testing Labs in the Cloud". Due to his proven track record in leading digital transformation within organizations, he has been recognized as one of the prestigious Orange Boomerang: Digital Leader of the Year 2023 award winners.