You're reading from LLM Engineer's Handbook Master the art of engineering large language models from concept to production

Product type Paperback

Published in Oct 2024

Publisher Packt

ISBN-13 9781836200079

Length 522 pages

Edition 1st Edition

Languages

Python

Tools

AWS

Concepts

Artificial Intelligence

Authors (3):

Maxime Labonne

Paul Iusztin

Alex Vesa

View More author details

Table of Contents (15) Chapters

Preface

1. Understanding the LLM Twin Concept and Architecture FREE CHAPTER

2. Tooling and Installation

3. Data Engineering

4. RAG Feature Pipeline

5. Supervised Fine-Tuning

6. Fine-Tuning with Preference Alignment

7. Evaluating LLMs

8. Inference Optimization

9. RAG Inference Pipeline

10. Inference Pipeline Deployment

11. MLOps and LLMOps

12. Other Books You May Enjoy

13. Index

Appendix: MLOps Principles

Deploying the LLM Twin’s pipelines to the cloud

This section will show you how to deploy all the LLM Twin’s pipelines to the cloud. We must deploy the entire infrastructure to have the whole system working in the cloud. Thus, we will have to:

Set up an instance of MongoDB serverless.
Set up an instance of Qdrant serverless.
Deploy the ZenML pipelines, container, and artifact registry to AWS.
Containerize the code and push the Docker image to a container registry.

Note that the training and inference pipelines already work with AWS SageMaker. Thus, by following the preceding four steps, we ensure that our whole system is on the cloud, ready to scale and serve our imaginary clients.

What are the deployment costs?

We will stick to the free versions of the MongoDB, Qdrant, and ZenML services. As for AWS, we will mostly stick to their free tier for running the ZenML pipelines. The SageMaker training and inference components are more costly to run (which we won’t run in this section). Thus, what we will show you in the following sections will generate minimum costs (a few dollars at most) from AWS.

Understanding the infrastructure

Before diving into the step-by-step tutorial, where we will show you how to set up all the necessary components, let’s briefly overview our infrastructure and how all the elements interact. This will help us in mindfully following the tutorials below.

As shown in Figure 11.5, we have a few services to set up. To keep things simple, for MongoDB and Qdrant, we will leverage their serverless freemium version. As for ZenML, we will leverage the free trial of the ZenML cloud, which will help us orchestrate all the pipelines in the cloud. How will it do that?

By leveraging the ZenML cloud, we can quickly allocate all the required AWS resources to run, scale, and store the ML pipeline. It will help us spin up, with a few clicks, the following AWS components:

An ECR service for storing Docker images
An S3 object storage for storing all our artifacts and models
SageMaker Orchestrator for orchestrating, running, and scaling all our ML pipelines

Figure 11.5: Infrastructure flow

Now that we understand what the essential resources of our infrastructure are, let’s look over the core flow of running a pipeline in the cloud that we will learn to implement, presented in Figure 11.5:

Build a Docker image that contains all the system dependencies, the project dependencies, and the LLM Twin application.
Push the Docker image to ECR, where SageMaker can access it.
Now, we can trigger any pipeline implemented during this book either from the CLI of our local machine or ZenML’s dashboard.
Each step from ZenML’s pipeline will be mapped to a SageMaker job that runs on an AWS EC2 virtual machine (VM). Based on the dependencies between the directed acyclic graph (DAG) steps, some will run in parallel and others sequentially.
When running a step, SageMaker pulls the Docker image from ECR, defined in step 2. Based on the pulled image, it creates a Docker container that executes the pipeline step.
As the job is executed, it can access the S3 artifact storage, MongoDB, and Qdrant vector DB to query or push data. The ZenML dashboard is a key tool, providing real-time updates on the pipeline’s progress and ensuring a clear view of the process.

Now that we know how the infrastructure works, let’s start by setting up MongoDB, Qdrant, and the ZenML cloud.

What AWS cloud region should I choose?

In our tutorials, all the services will be deployed to AWS within the Frankfurt (eu-central-1) region. You can select another region, but be consistent across all the services to ensure faster responses between components and reduce potential errors.

How should I manage changes in the services’ UIs?

Unfortunately, MongoDB, Qdrant, or other services may change their UI or naming conventions. As we can’t update this book each time that happens, please refer to their official documentation to check anything that differs from our tutorial. We apologize for this inconvenience, but unfortunately, it is not in our control.

Setting up MongoDB

We will show you how to create and integrate a free MongoDB cluster into our projects. To do so, these are the steps you have to follow:

Go to their site at https://www.mongodb.com and create an account.
In the left panel, go to Deployment | Database and click Build a Cluster.
Within the creation form, do the following:
1. Choose an M0 Free cluster.
2. Call your cluster twin.
3. Choose AWS as your provider.
4. Choose Frankfurt (eu-central-1) as your region. You can choose another region, but be careful to choose the same region for all future AWS services.
5. Leave the rest of the attributes with their default values.
6. In the bottom right, click the Create Deployment green button.
To test that your newly created MongoDB cluster works fine, we must connect to it from our local machine. We used the MongoDB VS Code extension to do so, but you can use any other tool. Thus, from their Choose a connection method setup flow, choose MongoDB for VS Code. Then, follow the steps provided on their site.
To connect, you must paste the DB connection URL in the VS Code extension (or another tool of your liking), which contains your username, password, and cluster URL, similar to this one: mongodb+srv://<username>:<password> @twin.vhxy1.mongodb.net. Make sure to save this URL somewhere you can copy it from later.
If you don’t know or want to change your password, go to Security → Quickstart in the left panel. There, you can edit your login credentials. Be sure to save them somewhere safe, as you won’t be able to access them later.
After verifying that your connections work, go to Security → Network Access in the left panel and click ADD IP ADDRESS. Then click ALLOW ACCESS FROM ANYWHERE and hit Confirm. Out of simplicity, we allow any machine from any IP to access our MongoDB cluster. This ensures that our pipelines can query or write to the DB without any additional complex networking setup. It’s not the safest option for production, but for our example, it’s perfectly fine.
The final step is to return to your project and open your .env file. Now, either add or replace the DATABASE_HOST variable with your MongoDB connection string. It should look something like this: DATABASE_HOST= mongodb+srv://<username>:<password> @twin.vhxy1.mongodb.net.

That’s it! Now, instead of reading and writing from your local MongoDB, you will do it from the cloud MongoDB cluster we just created. Let’s repeat a similar process with Qdrant.

Setting up Qdrant

We have to repeat a similar process to what we did for MongoDB. Thus, to create a Qdrant cluster and hook it to our project, follow these steps:

Go to Qdrant at https://cloud.qdrant.io/ and create an account.
In the left panel, go to Clusters and click Create.
Fill out the cluster creation form with the following:
1. Choose the Free version of the cluster.
2. Choose GCP as the cloud provider (while writing the book, it was the only one allowed for a free cluster).
3. Choose Frankfurt as the region (or the same region as you chose for MongoDB).
4. Name the cluster twin.
5. Leave the rest of the attributes with their default values and click Create.
Access the cluster in the Data Access Control section in the left panel.
Click Create and choose your twin cluster to create a new access token. Copy the newly created token somewhere safe, as you won’t be able to access it anymore.
You can run their example from Usage Examples to test that your connection works fine.
Go back to the Clusters section of Qdrant and open your newly created twin cluster. You will have access to the cluster’s endpoint, which you need to configure Qdrant in your code.

You can visualize your Qdrant collections and documents by clicking Open Dashboard and entering your API Key as your password. The Qdrant cluster dashboard will now be empty, but after running the pipelines, you will see all the collections, as shown here:

Figure 11.6: Qdrant cluster dashboard example after being populated with two collections.

Finally, return to your project and open your .env file. Now, we must fill in a couple of environment variables as follows:

USE_QDRANT_CLOUD=true
QDRANT_CLOUD_URL=<the endpoint URL found at step 7>
QDRANT_APIKEY=<the access token created at step 5>

That’s it! Instead of reading and writing from your local Qdrant vector DB, you will do it from the cloud Qdrant cluster we just created. Just to be sure that everything works fine, run the end-to-end data pipeline with the cloud version of MongoDB and Qdrant as follows:

peotry poe run-end-to-end-data-pipeline

The last step is setting up the ZenML cloud and deploying all our infrastructure to AWS.

Setting up the ZenML cloud

Setting up the ZenML cloud and the AWS infrastructure is a multi-step process. First, we will set up a ZenML cloud account, then the AWS infrastructure through the ZenML cloud, and, finally, we will bundle our code in a Docker image to run it in AWS SageMaker.

Let’s start with setting up the ZenML cloud:

Go to the ZenML cloud at https://cloud.zenml.io and make an account. They provide a seven-day free trial, which is enough to run our examples.
Fill out their onboarding form and create an organization with a unique name and a tenant called twin. A tenant refers to a deployment of ZenML in a fully isolated environment. Wait a few minutes until your tenant server is up before proceeding to the next step.
If you want to, you can go through their Quickstart Guide to understand how the ZenML cloud works with a simpler example. It is not required to go through it to deploy the LLM Twin application, but we recommend it to ensure everything works fine.
At this point, we assume that you have gone through the Quickstart Guide. Otherwise, you might encounter issues during the next steps. To connect our project with this ZenML cloud tenant, return to the project and run the zenml connect command provided in the dashboard. It looks similar to the following example but with a different URL:zenml connect --url https://0c37a553-zenml.cloudinfra.zenml.io.
To ensure everything works fine, run a random pipeline from your code. Note that at this point, we are still running it locally, but instead of logging the results to the local server, we log everything to the cloud version:
```
poetry poe run-digital-data-etl
```
Go to the Pipelines section in the left panel of the ZenML dashboard. If everything worked fine, you should see the pipeline you ran in Step 5 there.

Ensure that your ZenML server version matches your local ZenML version. For example, when we wrote this book, both were version 0.64.0. If they don’t match, you might encounter strange behavior, or it might not work correctly. The easiest fix is to go to your pyproject.toml file, find the zenml dependency, and update it with the version of your server. Then run poetry lock --no-update && poetry install to update your local virtual environment.

To ship the code to AWS, you must create a ZenML stack. A stack is a set of components, such as the underlying orchestrator, object storage, and container registry, that ZenML needs under the hood to run the pipelines. Intuitively, you can see your stack as your infrastructure. While working locally, ZenML offers a default stack that allows you to quickly develop your code and test things locally. However, by defining different stacks, you can quickly switch between different infrastructure environments, such as local and AWS runs, which we will showcase in this section.

Before starting this section, ensure you have an AWS account with admin permissions ready.

With that in mind, let’s create an AWS stack for our project. To do so, follow the next steps:

In the left panel, click on the Stacks section and hit the New Stack button.
You will have multiple options for creating a stack, but the easiest is creating one from scratch within the in-browser experience, which doesn’t require additional preparations. This is not very flexible, but it is enough to host our project. Thus, choose Create New Infrastructure → In-browser Experience.
Then, choose AWS as your cloud provider.
Choose Europe (Frankfurt)—eu-central-1 as your location or the region you used to set up MongoDB and Qdrant.
Name it aws-stack. It is essential to name it exactly like this so that the commands that we will use work.
Now ZenML will create a set of IAM roles to give permissions to all the other components to communicate with each other, an S3 bucket as your artifact storage, an ECR repository as your container registry, and SageMaker as your orchestrator.
Click Next.
Click the Deploy to AWS button. It will open a CloudFormation page on AWS. ZenML leverages CloudFormation (an infrastructure as code, or IaC, tool) to create all the AWS resources we enumerated in Step 6.
At the bottom, check all the boxes to acknowledge that AWS CloudFormation will create AWS resources on your behalf. Finally, click the Create stack button. Now, we must wait for a couple of minutes for AWS CloudFormation to spin up all the resources.
Return to the ZenML page and click the Finish button.

By leveraging ZenML, we efficiently deployed the entire AWS infrastructure for our ML pipelines. We began with a basic example, sacrificing some control. However, if you seek more control, ZenML offers the option to use Terraform (an IaC tool) to fully control your AWS resources or to connect ZenML with your current infrastructure.

Before moving to the next step, let’s have a quick recap of the AWS resources we just created:

An IAM role is an AWS identity with permissions policies that define what actions are allowed or denied for that role. It is used to grant access to AWS services without needing to share security credentials.
S3 is a scalable and secure object storage service that allows storing and retrieving files from anywhere on the web. It is commonly used for data backup, content storage, and data lakes. It’s more scalable and flexible than Google Drive.
ECR is a fully managed Docker container registry that makes storing, managing, and deploying Docker container images easy.
SageMaker is a fully managed service that allows developers and data scientists to quickly build, train, and deploy ML models.
SageMaker Orchestrator is a feature of SageMaker that helps automate the execution of ML workflows, manage dependencies between steps, and ensure the reproducibility and scalability of model training and deployment pipelines. Other similar tools are Prefect, Dagster, Metaflow, and Airflow.
CloudFormation is a service that allows you to model and set up your AWS resources so that you can spend less time managing them and more time focusing on your applications. It automates the process of provisioning AWS infrastructure using templates.

Before running the ML pipelines, the last step is to containerize the code and prepare a Docker image that packages our dependencies and code.

Containerize the code using Docker

So far, we have defined our infrastructure, MongoDB, Qdrant, and AWS, for storage and computing. The last step is to find a way to take our code and run it on top of this infrastructure. The most popular solution is Docker, a tool that allows us to create an isolated environment (a container) that contains everything we need to run our application, such as system dependencies, Python dependencies, and the code.

We defined our Docker image at the project’s root in the Dockerfile. This is the standard naming convention for Docker. Before digging into the code, if you want to build the Docker image yourself, ensure that you have Docker installed on your machine. If you don’t have it, you can install it by following the instructions provided here: https://docs.docker.com/engine/install. Now, let’s look at the content of the Dockerfile step by step.

The Dockerfile begins by specifying the base image, which is a lightweight version of Python 3.11 based on the Debian Bullseye distribution. The environment variables are then set up to configure various aspects of the container, such as the workspace directory, turning off Python bytecode generation, and configuring Python to output directly to the terminal. Additionally, the version of Poetry to be installed is specified, and a few environment variables are set to ensure that package installations are non-interactive, which is vital for automated builds.

FROM python:3.11-slim-bullseye AS release
ENV WORKSPACE_ROOT=/app/
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV POETRY_VERSION=1.8.3
ENV DEBIAN_FRONTEND=noninteractive
ENV POETRY_NO_INTERACTION=1

Next, we install Google Chrome in the container. The installation process begins by updating the package lists and installing essential tools like gnupg, wget, and curl. The Google Linux signing key is added, and the Google Chrome repository is configured. After another package list update, the stable version of Google Chrome is installed. The package lists are removed after installation to keep the image as small as possible.

RUN apt-get update -y && \
    apt-get install -y gnupg wget curl --no-install-recommends && \
    wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg && \
    echo "deb [signed-by=/usr/share/keyrings/google-linux-signing-key.gpg] https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list && \
    apt-get update -y && \
    apt-get install -y google-chrome-stable && \
    rm -rf /var/lib/apt/lists/*

Following the Chrome installation, other essential system dependencies are installed. Once these packages are installed, the package cache is cleaned up to reduce the image size further.

RUN apt-get update -y \
    && apt-get install -y --no-install-recommends build-essential \
    gcc \
    python3-dev \
    build-essential \
    libglib2.0-dev \
    libnss3-dev \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

Poetry, the dependency management tool, is then installed using pip. The --no-cache-dir option prevents pip from caching packages, helping to keep the image smaller. After installation, Poetry is configured to use up to 20 parallel workers when installing packages, which can speed up the installation process.

RUN pip install --no-cache-dir "poetry==$POETRY_VERSION"
RUN poetry config installer.max-workers 20

The working directory inside the container is set to WORKSPACE_ROOT, which defaults to /app/, where the application code will reside. The pyproject.toml and poetry.lock files define the Python’s project dependencies and are copied into this directory.

WORKDIR $WORKSPACE_ROOT
COPY pyproject.toml poetry.lock $WORKSPACE_ROOT

With the dependency files in place, the project’s dependencies are installed using Poetry. The configuration turns off the creation of a virtual environment, meaning the dependencies will be installed directly into the container’s Python environment. The installation excludes development dependencies and prevents caching to minimize space usage.

Additionally, the poethepoet plugin is installed to help manage tasks within the project. Finally, any remaining Poetry cache is removed to keep the container as lean as possible.

RUN poetry config virtualenvs.create false && \
    poetry install --no-root --no-interaction --no-cache --without dev && \
    poetry self add 'poethepoet[poetry_plugin]' && \
    rm -rf ~/.cache/pypoetry/cache/ && \
    rm -rf ~/.cache/pypoetry/artifacts/

In the final step, the entire project directory from the host machine is copied into the container’s working directory. This step ensures that all the application files are available within the container.

One important trick when writing a Dockerfile is to decouple your installation steps from copying the rest of the files. This is useful because each Docker command is cached and layered on top of each other. Thus, whenever you change one layer when rebuilding the Docker image, all the layers below the one altered are executed again. Because you rarely change your system and project dependencies but mostly change your code, copying your project files in the last step makes rebuilding Docker images fast by taking advantage of the caching mechanism’s full potential.

COPY . $WORKSPACE_ROOT

This Dockerfile is designed to create a clean, consistent Python environment with all necessary dependencies. It allows the project to run smoothly in any environment that supports Docker.

The last step is to build the Docker image and push it to the ECR created by ZenML. To build the Docker image from the root of the project, run the following:

docker buildx build --platform linux/amd64 -t llmtwin -f Dockerfile .

We must build it on a Linux platform as the Google Chrome installer we used inside Docker works only on a Linux machine. Even if you use a macOS or Windows machine, Docker can emulate a virtual Linux container.

The tag of the newly created Docker image is llmtwin. We also provide this build command under a poethepoet command:

poetry poe build-docker-image

Now, let’s push the Docker image to ECR. To do so, navigate to your AWS console and then to the ECR service. From there, find the newly created ECR repository. It should be prefixed with zenml-*, as shown here:

Figure 11.7: AWS ECR example

The first step is to authenticate to ECR. For this to work, ensure that you have the AWS CLI installed and configured with your admin AWS credentials, as explained in Chapter 2:

AWS_REGION=<your_region> # e.g. AWS_REGION=eu-central-1
AWS_ECR_URL=<your_acount_id>
aws ecr get-login-password --region ${AWS_REGION}| docker login --username AWS --password-stdin ${AWS_ECR_URL}

You can get your current AWS_REGION by clicking on the toggle in the top-right corner, as seen in Figure 11.8. Also, you can copy the ECR URL to fill the AWS_ECR_URL variable from the main AWS ECR dashboard, as illustrated in Figure 11.7. After running the previous command, you should see the message Login Succeeded on the CLI.

Figure 11.8: AWS region and account details

Now we have to add another tag to the llmtwin Docker image that signals the Docker registry we want to push it to:

docker tag llmtwin ${AWS_ECR_URL}:latest

Finally, we push it to ECR by running:

docker push ${AWS_ECR_URL}:latest

After the upload is finished, return to your AWS ECR dashboard and open your ZenML repository. The Docker image should appear, as shown here:

Figure 11.9: AWS ECR repository example after the Docker image is pushed

For every change in the code that you need to ship and test, you would have to go through all these steps, which are tedious and error-prone. The Adding LLMOps to the LLM Twin section of this chapter will teach us how to automate these steps within the CD pipeline using GitHub Actions. Still, we first wanted to go through them manually to fully understand the behind-the-scenes process and not treat it as a black box. Understanding these details is vital for debugging your CI/CD pipelines, where you must understand the error messages and how to fix them.

Now that we have built our Docker image and pushed it to AWS ECR, let’s deploy it to AWS.

Run the pipelines on AWS

We are very close to running the ML pipelines on AWS, but we have to go through a few final steps. Let’s switch from the default ZenML stack to the AWS one we created in this chapter. From the root of your project, run the following in the CLI:

zenml stack set aws-stack

Return to your AWS ECR ZenML repository and copy the image URI as shown in Figure 11.9. Then, go to the configs directory, open the configs/end_to_end_data.yaml file, and update the settings.docker.parent_image attribute with your ECR URL, as shown below:

settings:
  docker:
    parent_image: <YOUR ECR URL> #e.g., 992382797823.dkr.ecr.eu-central-1.amazonaws.com/zenml-rlwlcs:latest
    skip_build: True

We’ve configured the pipeline to always use the latest Docker image available in ECR. This means that the pipeline will automatically pick up the latest changes made to the code whenever we push a new image.

We must export all the credentials from our .env file to ZenML secrets, a feature that safely stores your credentials and makes them accessible within your pipelines:

poetry poe export-settings-to-zenml

The last step is setting up to run the pipelines asynchronously so we don’t have to wait until they are finished, which might result in timeout errors:

zenml orchestrator update aws-stack --synchronous=False

Now that ZenML knows to use the AWS stack, our custom Docker image, and has access to our credentials, we are finally done with the setup. Run the end-to-end-data-pipeline with the following command:

poetry poe run-end-to-end-data-pipeline

Now you can go to ZenML Cloud → Pipelines → end_to_end_data and open the latest run. On the ZenML dashboard, you can visualize the latest state of the pipeline, as seen in Figure 11.10. Note that this pipeline runs all the data-related pipelines in a single run.

In the Adding LLMOps to the LLM Twin section, we will explain why we compressed all the steps into a single pipeline.

Figure 11.10: ZenML example of running the end-to-end-data-pipeline

You can click on any running block and find details about the run, the code used for that specific step, and the logs for monitoring and debugging, as illustrated in Figure 11.11:

Figure 11.11: ZenML step metadata example

To run other pipelines, you have to update the settings.docker.parent_image attribute in their config file under the configs/ directory.

To find even more details about the runs, you can go to AWS SageMaker. In the left panel, click SageMaker dashboard, and on the right, in the Processing column, click on the green Running section, as shown in Figure 11.12.

This will open a list of all the processing jobs that execute your ZenML pipelines.

Figure 11.12: SageMaker dashboard

If you want to run the pipelines locally again, use the following CLI command:

poetry poe set-local-stack

If you want to disconnect from the ZenML cloud dashboard and use the local version again, run the following:

zenml disconnect

Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker

Let’s assume, you’ve encountered a ResourceLimitExceeded error after running a ZenML pipeline on SageMaker using the AWS stack. In this case, you have to explicitly ask AWS to give you access to a specific type of AWS EC2 VM.

ZenML uses, by default, ml.t3.medium EC2 machines, which are part of the AWS freemium tier. However, some AWS accounts cannot access these VMs by default. To check your access, search your AWS console for Service Quotas.

Then, in the left panel, click on AWS services, search for Amazon SageMaker, and then for ml.t3.medium. In Figure 11.13, you can see our quotas for these types of machines. If yours is 0, you should request that AWS increase them to numbers similar to those from Figure 11.13 in the Applied account-level quota value column. The whole process is free of charge and only requires a few clicks. Unfortunately, you might have to wait for a few hours up to one day until AWS accepts your request.

Figure 11.13: SageMaker—ml.t3.medium expected quotas

You can find step-by-step instructions on how to solve this error and request new quotas at this link: https://repost.aws/knowledge-center/sagemaker-resource-limit-exceeded-error.

If you changed the values from your .env file and want to update the ZenML secrets with them, first run the following CLI command to delete the old secrets: