Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Deploying LLM Models in Kubernetes with KFServing

Save for later
  • 14 min read
  • 21 Aug 2023

article-image

Deploying LLM models, like Hugging Face transformer library's extractive question-answering model, is popular in NLP. Learn to deploy LLM models in Kubernetes via KFServing. Utilize Hugging Face's transformers library to deploy an extractive question-answering model. KFServing ensures standard model serving with features like explainability and model management. Set up KFServing, craft a Python model server, build a Docker image, and deploy to Kubernetes with Minikube.

Introduction

Deploying machine learning models to production is a critical step in turning research and development efforts into practical applications. In this tutorial, we will explore how to deploy Language Model (LLM) models in a Kubernetes cluster using KFServing. We will leverage the power of KFServing to simplify the model serving process, achieve scalability, and ensure seamless integration with existing infrastructure.

To illustrate the relevance of deploying LLM models, let's consider a business use case. Imagine you are building an intelligent chatbot that provides personalized responses to customer queries. By deploying an LLM model, the chatbot can generate contextual and accurate answers, enhancing the overall user experience. With KFServing, you can easily deploy and scale the LLM model, enabling real-time interactions with users.

By the end of this tutorial, you will have a solid understanding of deploying LLM models with KFServing and be ready to apply this knowledge to your own projects.

Architecture Overview

Before diving into the deployment process, let's briefly discuss the architecture. Our setup comprises a Kubernetes cluster running in Minikube, KFServing as a framework to deploy the services, and a custom LLM model server. The Kubernetes cluster provides the infrastructure for deploying and managing the model. KFServing acts as a serving layer that facilitates standardized model serving across different frameworks. Finally, the custom LLM model server hosts the pre-trained LLM model and handles inference requests.

Prerequisites and Setup

To follow along with this tutorial, ensure that you have the following prerequisites:

  • A Kubernetes cluster: You can set up a local Kubernetes cluster using Minikube or use a cloud-based Kubernetes service like Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS).
  • Docker: Install Docker to build and containerize the custom LLM model server.
  • Python and Dependencies: Install Python and the necessary dependencies, including KFServing, Transformers, TensorFlow, and other required packages. You can find a list of dependencies in the requirements.txt file.

Now that we have our prerequisites, let's proceed with the deployment process.

Introduction to KFServing

KFServing is designed to provide a standardized way of serving machine learning models across organizations. It offers high abstraction interfaces for common ML frameworks like TensorFlow, PyTorch, and more. By leveraging KFServing, data scientists and MLOps teams can collaborate seamlessly from model production to deployment. KFServing can be easily integrated into existing Kubernetes and Istio stacks, providing model explainability, inference graph operations, and other model management functions.

Setting Up KFServing

To begin, we need to set up KFServing on a Kubernetes cluster. For this tutorial, we'll use the local quick install method on a Minikube Kubernetes cluster. The quick install method allows us to install Istio and KNative without the full Kubeflow setup, making it ideal for local development and testing.

Start by installing the necessary dependencies: kubectl, and Helm 3. We will assume that they are already set up. Then, follow the Minikube install instructions to complete the setup. Adjust the memory and CPU settings for Minikube to ensure smooth functioning. Once the installation is complete, start Minikube and verify the cluster status using the following commands:

minikube start --memory=6144
minikube status

The kfserving-custom-model requests at least 4Gi of memory, so in this case, we provide it with a bit more.

Building a Custom Python Model Server

Now, we'll focus on the code required to build a custom Python model server for the Hugging Face extractive question-answering model. We'll use the KFServing model class and implement the necessary methods. We will start by understanding the code that powers the custom LLM model server. The server is implemented using Python and leverages the Hugging Face transformer library.

Let’s start by creating a new Python file and naming it kf_model_server.py. Import the required libraries and define the KFServing_BERT_QA_Model class that inherits from kfserving.KFModel. This class will handle the model loading and prediction logic:

# Import the required libraries and modules
import kfserving
from typing import List, Dict
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf
import base64
import io
 
# Define the custom model server class
class kf_serving_model (kfserving.KFModel):
    def __init__(self, name: str):
        super().__init__(name)
        self.name = name
        self.ready = False
        self.tokenizer = None
 
    def load(self):
        # Load the pre-trained model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
        self.model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
        self.ready = True
 
    def predict(self, request: Dict) -> Dict:
        inputs = request["instances"]
 
        # Perform inference on the input instances
        source_text = inputs[0]["text"]
        questions = inputs[0]["questions"]
        results = {}
 
        for question in questions:
            # Tokenize the question and source text
            inputs = self.tokenizer.encode_plus(question, source_text, add_special_tokens=True, return_tensors="tf")
            input_ids = inputs["input_ids"].numpy()[0]
            answer_start_scores, answer_end_scores = self.model(inputs)
 
            # Extract the answer from the scores
            answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
            answer_end = (tf.argmax(answer_end_scores, axis=1) + 1).numpy()[0]
            answer = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
 
            results[question] = answer
 
        return {"predictions": results}
   
 
if __name__ == "__main__":
    model = kf_serving_model("kfserving-custom-model")
    model.load()
    kfserving.KFServer(workers=1).start([model])

In the above code, we define the kf_serving_model class that inherits from kfserving.KFModel and initializes the model and tokenizer. The class encapsulates the model loading and prediction logic. The load() method loads the pre-trained model and tokenizer from the Hugging Face library. The predict() method takes the input JSON and performs inference using the model. It generates question-answer pairs and returns them in the response.

Before we proceed, let's discuss some best practices for deploying LLM models with KFServing:

  • Model Versioning: Maintain different versions of the LLM model to support A/B testing, rollback, and easy model management.
  • Scalability: Design the deployment to handle high traffic loads by optimizing resource allocation and leveraging horizontal scaling techniques.
  • Monitoring and Error Handling: Implement robust logging and monitoring mechanisms to track model performance, detect anomalies, and handle errors gracefully.
  • Performance Optimization: Explore techniques like batch processing, parallelization, and caching to optimize the inference speed and resource utilization of the deployed model.

Now that we have a good understanding of the code and best practices, let's proceed with the deployment process.

Deployment Steps:

For the deployment, first, we need to set up the Kubernetes cluster and ensure it is running smoothly. You can use Minikube or a cloud-based Kubernetes service. Once the cluster is running, we install the KFServing CRD by cloning the KFServing repository and navigating to the cloned directory:

git clone git@github.com:kubeflow/kfserving.git
cd kfserving

Now we install the necessary dependencies using the hack/quick_install.sh script:

./hack/quick_install.sh

To deploy our custom model server, we need to package it into a Docker container image. This allows for easy distribution and deployment across different environments.

Building a Docker Image for the Model Server

Let’s create the Docker image by creating a new file named Dockerfile in the same directory as the Python file:

# Use the official lightweight Python image.
FROM python:3.7-slim
 
ENV APP_HOME /app
WORKDIR $APP_HOME
 
# Install production dependencies.
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt
 
# Copy local code to the container image
COPY kf_model_server.py ./
 
CMD ["python", "kf_model_server.py"]
 

The Dockerfile specifies the base Python image, sets the working directory, installs the dependencies from the requirements.txt file, and copies the Python code into the container. Here we will be running this locally on a CPU, so we will be using tensorflow-cpu for the application:

kfserving==0.3.0
transformers==2.1.1
tensorflow-cpu==2.2.0
protobuf==3.20.0

To build the Docker image, execute the following command:

docker build -t kfserving-custom-model .

This command builds the container image using the Dockerfile and tags it with the specified name.

When you build a Docker image using docker build -t kfserving-custom-model ., the image is only available in your local Docker environment. Kubernetes can't access images from your local Docker environment unless you're using a tool like Minikube or kind with a specific configuration to allow this.

To make the image available to Kubernetes, you need to push it to a Docker registry like Docker Hub, Google Container Registry (GCR), or any other registry accessible to your Kubernetes cluster.

Here are the general steps you need to follow:

Tag your image with the registry address:

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime

If you are using Docker Hub, the command is:

docker tag kfserving-custom-model:latest <your-dockerhub-username>/kfserving-custom-model:latest

Push the image to the registry:

For Docker Hub, the command is:

docker push <your-dockerhub-username>/kfserving-custom-model:latest

Make sure to replace <your-dockerhub-username> with your actual Docker Hub username. Also, ensure that your Kubernetes cluster has the necessary credentials to pull from the registry if it's private. If it's a public Docker Hub repository, there should be no issues.

Deploying the Custom Model Server on KFServing

Now that we have the Docker image, we can deploy the custom model server as an InferenceService on KFServing. We'll use a YAML configuration file to describe the Kubernetes model resource. Create a file named deploy_server.yaml and populate it with the following content:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: kfserving-custom-model
spec:
  predictor:
    containers:
    - image: <your-dockerhub-username>/kfserving-custom-model:latest
      name: kfserving-container
      resources:
        requests:
          memory: "4096Mi"
          cpu: "250m"
        limits:
          memory: "4096Mi"
          cpu: "500m"

The YAML file defines the model's metadata, including the name and labels. It specifies the container image to use, along with resource requirements for memory and CPU.

To deploy the model, run the following command:

kubectl apply -f deploy_server.yaml

This command creates the InferenceService resource in the Kubernetes cluster, deploying the custom model server.

Verify the deployment status:

kubectl get inferenceservices

This should show you the status of the inference service:

deploying-llm-models-in-kubernetes-with-kfserving-img-0

We can see that the containers have downloaded the BERT model and now there are ready to start receiving inference calls.

Making an Inference Call with the KFServing-Hosted Model

Once the model is deployed on KFServing, we can make inference calls to the locally hosted Hugging Face QA model. To do this, we'll need to set up port forwarding to expose the model's port to our local system.

Execute the following command to determine if your Kubernetes cluster is running in an environment that supports external load balancers

kubectl get svc istio-ingressgateway -n istio-system

Now we can do Port Forward for testing purposes:

INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80
# start another terminal
export INGRESS_HOST=localhost
export INGRESS_PORT=8080

This command forwards port 8080 on our local system to port 80 of the model's service. It enables us to access the model's endpoint locally.

Next, create a JSON file named kf_input.json with the following content:

{
  "instances": [
    {
      "text": "Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.",
      "questions": [
        "How many pretrained models are available in Transformers?",
        "What does Transformers provide?",
        "Transformers provides interoperability between which frameworks?"
      ]
    }
  ]
}

The JSON file contains the input text and a list of questions for the model to answer. To make an inference call, use the CURL command:

curl -v -H "Host: kfserving-custom-model.default.example.com" -d @./kf_input.json <http://localhost:8080/v1/models/kfserving-custom-model:predict>

This command sends the JSON file as input to the predict method of our custom InferenceService. It forwards the request to the model's endpoint. It returns the next predictions:

{"predictions":
      {"How many pretrained models are available in Transformers?":
                  "over 32 +",
            "What does Transformers provide?":
                  "general - purpose architectures",
            "Transformers provides interoperability between which frameworks?":
                  "tensorflow 2 . 0 and pytorch"}
}

We can see the whole operation here:

deploying-llm-models-in-kubernetes-with-kfserving-img-1

The response includes the generated question-answer pairs for each one of the specified questions.

Conclusion

In this tutorial, we learned how to deploy Language Model (LLM) models in a Kubernetes cluster using KFServing. We set up KFServing, built a custom Python model server using the Hugging Face extractive question-answering model, created a Docker image for the model server, and deployed the model as an InferenceService on KFServing. We also made inference calls to the hosted model and obtained question-answer pairs. By following this guide, you can deploy your own LLM models in Kubernetes with ease.

Deploying LLM models in Kubernetes with KFServing simplifies the process of serving ML models at scale. It enables collaboration between data scientists and MLOps teams and provides standardized model-serving capabilities. With this knowledge, you can leverage KFServing to deploy and serve your own LLM models efficiently.

Author Bio:

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.

LinkedIn