Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
LLM Engineer's Handbook

You're reading from   LLM Engineer's Handbook Master the art of engineering large language models from concept to production

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781836200079
Length 522 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Maxime Labonne Maxime Labonne
Author Profile Icon Maxime Labonne
Maxime Labonne
Paul Iusztin Paul Iusztin
Author Profile Icon Paul Iusztin
Paul Iusztin
Alex Vesa Alex Vesa
Author Profile Icon Alex Vesa
Alex Vesa
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Understanding the LLM Twin Concept and Architecture 2. Tooling and Installation FREE CHAPTER 3. Data Engineering 4. RAG Feature Pipeline 5. Supervised Fine-Tuning 6. Fine-Tuning with Preference Alignment 7. Evaluating LLMs 8. Inference Optimization 9. RAG Inference Pipeline 10. Inference Pipeline Deployment 11. MLOps and LLMOps 12. Other Books You May Enjoy
13. Index
Appendix: MLOps Principles

MLOps and LLMOps tooling

This section will quickly present all the MLOps and LLMOps tools we will use throughout the book and their role in building ML systems using MLOps best practices. At this point in the book, we don’t aim to detail all the MLOps components we will use to implement the LLM Twin use case, such as model registries and orchestrators, but only provide a quick idea of what they are and how to use them. As we develop the LLM Twin project throughout the book, you will see hands-on examples of how we use all these tools. In Chapter 11, we will dive deeply into the theory of MLOps and LLMOps and connect all the dots. As the MLOps and LLMOps fields are highly practical, we will leave the theory of these aspects to the end, as it will be much easier to understand it after you go through the LLM Twin use case implementation.

Also, this section is not dedicated to showing you how to set up each tool. It focuses primarily on what each tool is used for and highlights the core features used throughout this book.

Still, using Docker, you can quickly run the whole infrastructure locally. If you want to run the steps within the book yourself, you can host the application locally with these three simple steps:

  1. Have Docker 27.1.1 (or higher) installed.
  2. Fill your .env file with all the necessary credentials as explained in the repository README.
  3. Run poetry poe local-infrastructure-up to locally spin up ZenML (http://127.0.0.1:8237/) and the MongoDB and Qdrant databases.

You can read more details on how to run everything locally in the LLM-Engineers-Handbook repository README: https://github.com/PacktPublishing/LLM-Engineers-Handbook. Within the book, we will also show you how to deploy each component to the cloud.

Hugging Face: model registry

A model registry is a centralized repository that manages ML models throughout their lifecycle. It stores models along with their metadata, version history, and performance metrics, serving as a single source of truth. In MLOps, a model registry is crucial for tracking, sharing, and documenting model versions, facilitating team collaboration. Also, it is a fundamental element in the deployment process as it integrates with continuous integration and continuous deployment (CI/CD) pipelines.

We used Hugging Face as our model registry, as we can leverage its ecosystem to easily share our fine-tuned LLM Twin models with anyone who reads the book. Also, by following the Hugging Face model registry interface, we can easily integrate the model with all the frameworks around the LLMs ecosystem, such as Unsloth for fine-tuning and SageMaker for inference.

Our fine-tuned LLMs are available on Hugging Face at:

Figure 2.1: Hugging Face model registry example

For a quick demo, we have them available on Hugging Face Spaces:

Most ML tools provide model registry features. For example, ZenML, Comet, and SageMaker, which we will present in future sections, also offer their own model registries. They are good options, but we picked Hugging Face solely because of its ecosystem, which provides easy shareability and integration throughout the open-source environment. Thus, you will usually select the model registry that integrates the most with your project’s tooling and requirements.

ZenML: orchestrator, artifacts, and metadata

ZenML acts as the bridge between ML and MLOps. Thus, it offers multiple MLOps features that make your ML pipeline traceability, reproducibility, deployment, and maintainability easier. At its core, it is designed to create reproducible workflows in machine learning. It addresses the challenge of transitioning from exploratory research in Jupyter notebooks to a production-ready ML environment. It tackles production-based replication issues, such as versioning difficulties, reproducing experiments, organizing complex ML workflows, bridging the gap between training and deployment, and tracking metadata. Thus, ZenML’s main features are orchestrating ML pipelines, storing and versioning ML pipelines as outputs, and attaching metadata to artifacts for better observability.

Instead of being another ML platform, ZenML introduced the concept of a stack, which allows you to run ZenML on multiple infrastructure options. A stack will enable you to connect ZenML to different cloud services, such as:

  • An orchestrator and compute engine (for example, AWS SageMaker or Vertex AI)
  • Remote storage (for instance, AWS S3 or Google Cloud Storage buckets)
  • A container registry (for example, Docker Registry or AWS ECR)

Thus, ZenML acts as a glue that brings all your infrastructure and tools together in one place through its stack feature, allowing you to quickly iterate through your development processes and easily monitor your entire ML system. The beauty of this is that ZenML doesn’t vendor-lock you into any cloud platform. It completely abstracts away the implementation of your Python code from the infrastructure it runs on. For example, in our LLM Twin use case, we used the AWS stack:

  • SageMaker as our orchestrator and compute
  • S3 as our remote storage used to store and track artifacts
  • ECR as our container registry

However, the Python code contains no S3 or ECR particularities, as ZenML takes care of them. Thus, we can easily switch to other providers, such as Google Cloud Storage or Azure. For more details on ZenML stacks, you can start here: https://docs.zenml.io/user-guide/production-guide/understand-stacks.

We will focus only on the ZenML features used throughout the book, such as orchestrating, artifacts, and metadata. For more details on ZenML, check out their starter guide: https://docs.zenml.io/user-guide/starter-guide.

The local version of the ZenML server comes installed as a Python package. Thus, when running poetry install, it installs a ZenML debugging server that you can use locally. In Chapter 11, we will show you how to use their cloud serverless option to deploy the ML pipelines to AWS.

Orchestrator

An orchestrator is a system that automates, schedules, and coordinates all your ML pipelines. It ensures that each pipeline—such as data ingestion, preprocessing, model training, and deployment—executes in the correct order and handles dependencies efficiently. By managing these processes, an orchestrator optimizes resource utilization, handles failures gracefully, and enhances scalability, making complex ML pipelines more reliable and easier to manage.

How does ZenML work as an orchestrator? It works with pipelines and steps. A pipeline is a high-level object that contains multiple steps. A function becomes a ZenML pipeline by being decorated with @pipeline, and a step when decorated with @step. This is a standard pattern when using orchestrators: you have a high-level function, often called a pipeline, that calls multiple units/steps/tasks.

Let’s explore how we can implement a ZenML pipeline with one of the ML pipelines implemented for the LLM Twin project. In the code snippet below, we defined a ZenML pipeline that queries the database for a user based on its full name and crawls all the provided links under that user:

from zenml import pipeline
from steps.etl import crawl_links, get_or_create_user
@pipeline
def digital_data_etl(user_full_name: str, links: list[str]) -> None:
    user = get_or_create_user(user_full_name)
    crawl_links(user=user, links=links)

You can run the pipeline with the following CLI command: poetry poe run-digital-data-etl. To visualize the pipeline run, you can go to your ZenML dashboard (at http://127.0.0.1:8237/) and, on the left panel, click on the Pipelines tab and then on the digital_data_etl pipeline, as illustrated in Figure 2.2:

Figure 2.2: ZenML Pipelines dashboard

After clicking on the digital_data_etl pipeline, you can visualize all the previous and current pipeline runs, as seen in Figure 2.3. You can see which one succeeded, failed, or is still running. Also, you can see the stack used to run the pipeline, where the default stack is the one used to run your ML pipelines locally.

Figure 2.3: ZenML digital_data_etl pipeline dashboard. Example of a specific pipeline

Now, after clicking on the latest digital_data_etl pipeline run (or any other run that succeeded or is still running), we can visualize the pipeline’s steps, outputs, and insights, as illustrated in Figure 2.4. This structure is often called a directed acyclic graph (DAG). More on DAGs in Chapter 11.

Figure 2.4: ZenML digital_data_etl pipeline run dashboard (example of a specific pipeline run)

By clicking on a specific step, you can get more insights into its code and configuration. It even aggregates the logs output by that specific step to avoid switching between tools, as shown in Figure 2.5.

Figure 2.5: Example of insights from a specific step of the digital_data_etl pipeline run

Now that we understand how to define a ZenML pipeline and how to look it up in the dashboard, let’s quickly look at how to define a ZenML step. In the code snippet below, we defined the get_or_create_user() step, which works just like a normal Python function but is decorated with @step. We won’t go into the details of the logic, as we will cover the ETL logic in Chapter 3. For now, we will focus only on the ZenML functionality.

from loguru import logger
from typing_extensions import Annotated
from zenml import get_step_context, step
from llm_engineering.application import utils
from llm_engineering.domain.documents import UserDocument
@step
def get_or_create_user(user_full_name: str) -> Annotated[UserDocument, "user"]:
    logger.info(f"Getting or creating user: {user_full_name}")
    first_name, last_name = utils.split_user_full_name(user_full_name)
    user = UserDocument.get_or_create(first_name=first_name, last_name=last_name)
    return user

Within a ZenML step, you can define any Python logic your use case needs. In this simple example, we are just creating or retrieving a user, but we could replace that code with anything, starting from data collection to feature engineering and training. What is essential to notice is that to integrate ZenML with your code, you have to write modular code, where each function does just one thing. The modularity of your code makes it easy to decorate your functions with @step and then glue multiple steps together within a main function decorated with @pipeline. One design choice that will impact your application is deciding the granularity of each step, as each will run as a different unit on a different machine when deployed in the cloud.

To decouple our code from ZenML, we encapsulated all the application and domain logic into the llm_engineering Python module. We also defined the pipelines and steps folders, where we defined our ZenML logic. Within the steps module, we only used what we needed from the llm_engineering Python module (similar to how you use a Python package). In the pipelines module, we only aggregated ZenML steps to glue them into the final pipeline. Using this design, we can easily swap ZenML with another orchestrator or use our application logic in other use cases, such as a REST API. We only have to replace the ZenML code without touching the llm_engineering module where all our logic resides.

This folder structure is reflected at the root of the LLM-Engineers-Handbook repository, as illustrated in Figure 2.6:

Figure 2.6: LLM-Engineers-Handbook repository folder structure

One last thing to consider when writing ZenML steps is that if you return a value, it should be serializable. ZenML can serialize most objects that can be reduced to primitive data types, but there are a few exceptions. For example, we used UUID types as IDs throughout the code, which aren’t natively supported by ZenML. Thus, we had to extend ZenML’s materializer to support UUIDs. We raised this issue to ZenML. Hence, in future ZenML versions, UUIDs will be supported, but it was an excellent example of the serialization aspect of transforming function outputs in artifacts.

Artifacts and metadata

As mentioned in the previous section, ZenML transforms any step output into an artifact. First, let’s quickly understand what an artifact is. In MLOps, an artifact is any file(s) produced during the machine learning lifecycle, such as datasets, trained models, checkpoints, or logs. Artifacts are crucial for reproducing experiments and deploying models. We can transform anything into an artifact. For example, the model registry is a particular use case for an artifact. Thus, artifacts have these unique properties: they are versioned, sharable, and have metadata attached to them to understand what’s inside quickly. For example, when wrapping your dataset with an artifact, you can add to its metadata the size of the dataset, the train-test split ratio, the size, types of labels, and anything else useful to understand what’s inside the dataset without actually downloading it.

Let’s circle back to our digital_data_etl pipeline example, where we had as a step output an artifact, the crawled links, which are an artifact, as seen in Figure 2.7

Figure 2.7: ZenML artifact example using the digital_data_etl pipeline as an example

By clicking on the crawled_links artifact and navigating to the Metadata tab, we can quickly see all the domains we crawled for a particular author, the number of links we crawled for each domain, and how many were successful, as illustrated in Figure 2.8:

Figure 2.8: ZenML metadata example using the digital_data_etl pipeline as an example

A more interesting example of an artifact and its metadata is the generated dataset artifact. In Figure 2.9, we can visualize the metadata of the instruct_datasets artifact, which was automatically generated and will be used to fine-tune the LLM Twin model. More details on the instruction datasets are in Chapter 5. For now, we want to highlight that within the dataset’s metadata, we have precomputed a lot of helpful information about it, such as how many data categories it contains, its storage size, and the number of samples per training and testing split.

Figure 2.9: ZenML metadata example for the instruct_datasets artifact

The metadata is manually added to the artifact, as shown in the code snippet below. Thus, you can precompute and attach to the artifact’s metadata anything you consider helpful for dataset discovery across your business and projects:

# More imports
from zenml import ArtifactConfig, get_step_context, step
@step
def generate_intruction_dataset(
    prompts: Annotated[dict[DataCategory, list[GenerateDatasetSamplesPrompt]], "prompts"]) -> Annotated[
    InstructTrainTestSplit,
    ArtifactConfig(
        name="instruct_datasets",
        tags=["dataset", "instruct", "cleaned"],
    ),
]:
    datasets = … # Generate datasets
    step_context = get_step_context()
    step_context.add_output_metadata(output_name="instruct_datasets", metadata=_get_metadata_instruct_dataset(datasets))
    return datasets
def _get_metadata_instruct_dataset(datasets: InstructTrainTestSplit) -> dict[str, Any]:
    instruct_dataset_categories = list(datasets.train.keys())
    train_num_samples = {
        category: instruct_dataset.num_samples for category, instruct_dataset in datasets.train.items()
    }
    test_num_samples = {category: instruct_dataset.num_samples for category, instruct_dataset in datasets.test.items()}
    return {
        "data_categories": instruct_dataset_categories,
        "test_split_size": datasets.test_split_size,
        "train_num_samples_per_category": train_num_samples,
        "test_num_samples_per_category": test_num_samples,
    }

Also, you can easily download and access a specific version of the dataset using its Universally Unique Identifier (UUID), which you can find using the ZenML dashboard or CLI:

from zenml.client import Client
artifact = Client().get_artifact_version('8bba35c4-8ff9-4d8f-a039-08046efc9fdc')
loaded_artifact = artifact.load()

The last step in exploring ZenML is understanding how to run and configure a ZenML pipeline.

How to run and configure a ZenML pipeline

All the ZenML pipelines can be called from the run.py file, accessed at tools/run.py in our GitHub repository. Within the run.py file, we implemented a simple CLI that allows you to specify what pipeline to run. For example, to call the digital_data_etl pipeline to crawl Maxime’s content, you have to run:

python -m tools.run --run-etl --no-cache --etl-config-filename digital_data_etl_maxime_labonne.yaml

Or, to crawl Paul’s content, you can run:

python -m tools.run --run-etl --no-cache --etl-config-filename digital_data_etl_paul_iusztin.yaml

As explained when introducing Poe the Poet, all our CLI commands used to interact with the project will be executed through Poe to simplify and standardize the project. Thus, we encapsulated these Python calls under the following poe CLI commands:

poetry poe run-digital-data-etl-maxime
poetry poe run-digital-data-etl-paul

We only change the ETL config file name when scraping content for different people. ZenML allows us to inject specific configuration files at runtime as follows:

config_path = root_dir / "configs" / etl_config_filename
assert config_path.exists(), f"Config file not found: { config_path }"
run_args_etl = {
	"config_path": config_path,
	"run_name": f"digital_data_etl_run_{dt.now().strftime('%Y_%m_%d_%H_%M_%S')}"
}
 digital_data_etl.with_options()(**run_args_etl)

In the config file, we specify all the parameters that will input the pipeline as parameters. For example, the configs/digital_data_etl_maxime_labonne.yaml configuration file looks as follows:

parameters:
  user_full_name: Maxime Labonne # [First Name(s)] [Last Name]
  links:
    # Personal Blog
    - https://mlabonne.github.io/blog/posts/2024-07-29_Finetune_Llama31.html
    - https://mlabonne.github.io/blog/posts/2024-07-15_The_Rise_of_Agentic_Data_Generation.html
    # Substack
    - https://maximelabonne.substack.com/p/uncensor-any-llm-with-abliteration-d30148b7d43e
    … # More links

Where the digital_data_etl function signature looks like this:

@pipeline
def digital_data_etl(user_full_name: str, links: list[str]) -> str:

This approach allows us to configure each pipeline at runtime without modifying the code. We can also clearly track the inputs for all our pipelines, ensuring reproducibility. As seen in Figure 2.10, we have one or more configs for each pipeline.

Figure 2.10: ZenML pipeline configs

Other popular orchestrators similar to ZenML that we’ve personally tested and consider powerful are Airflow, Prefect, Metaflow, and Dagster. Also, if you are a heavy user of Kubernetes, you can opt for Agro Workflows or Kubeflow, the latter of which works only on top of Kubernetes. We still consider ZenML the best trade-off between ease of use, features, and costs. Also, none of these tools offer the stack feature that is offered by ZenML, which allows it to avoid vendor-locking you in to any cloud ecosystem.

In Chapter 11, we will explore in more depth how to leverage an orchestrator to implement MLOps best practices. But now that we understand ZenML, what it is helpful for, and how to use it, let’s move on to the experiment tracker.

Comet ML: experiment tracker

Training ML models is an entirely iterative and experimental process. Unlike traditional software development, it involves running multiple parallel experiments, comparing them based on predefined metrics, and deciding which one should advance to production. An experiment tracking tool allows you to log all the necessary information, such as metrics and visual representations of your model predictions, to compare all your experiments and quickly select the best model. Our LLM project is no exception.

As illustrated in Figure 2.11, we used Comet to track metrics such as training and evaluation loss or the value of the gradient norm across all our experiments.

Figure 2.11: Comet ML training metrics example

Using an experiment tracker, you can go beyond training and evaluation metrics and log your training hyperparameters to track different configurations between experiments.

It also logs out-of-the-box system metrics such as GPU, CPU, or memory utilization to give you a clear picture of what resources you need during training and where potential bottlenecks slow down your training, as seen in Figure 2.12.

Figure 2.12: Comet ML system metrics example

You don’t have to set up Comet locally. We will use their online version for free without any constraints throughout this book. Also, if you want to look more in-depth into the Comet ML experiment tracker, we made the training experiments tracked with Comet ML public while fine-tuning our LLM Twin models. You can access them here: https://www.comet.com/mlabonne/llm-twin-training/view/new/panels.

Other popular experiment trackers are W&B, MLflow, and Neptune. We’ve worked with all of them and can state that they all have mostly the same features, but Comet ML differentiates itself through its ease of use and intuitive interface. Let’s move on to the final piece of the MLOps puzzle: Opik for prompt monitoring.

Opik: prompt monitoring

You cannot use standard tools and techniques when logging and monitoring prompts. The reason for this is complicated. We will dig into it in Chapter 11. However, to quickly give you some understanding, you cannot use standard logging tools as prompts are complex and unstructured chains.

When interacting with an LLM application, you chain multiple input prompts and the generated output into a trace, where one prompt depends on previous prompts.

Thus, instead of plain text logs, you need an intuitive way to group these traces into a specialized dashboard that makes debugging and monitoring traces of prompts easier.

We used Opik, an open-source tool made by Comet, as our prompt monitoring tool because it follows Comet’s philosophy of simplicity and ease of use, which is currently relatively rare in the LLM landscape. Other options offering similar features are Langfuse (open source, https://langfuse.com), Galileo (not open source, rungalileo.io), and LangSmith (not open source, https://www.langchain.com/langsmith), but we found their solutions more cumbersome to use and implement. Opik, along with its serverless option, also provides a free open-source version that you have complete control over. You can read more on Opik at https://github.com/comet-ml/opik.

You have been reading a chapter from
LLM Engineer's Handbook
Published in: Oct 2024
Publisher: Packt
ISBN-13: 9781836200079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime