Deploying LLM models, like Hugging Face transformer library's extractive question-answering model, is popular in NLP. Learn to deploy LLM models in Kubernetes via KFServing. Utilize Hugging Face's transformers library to deploy an extractive question-answering model. KFServing ensures standard model serving with features like explainability and model management. Set up KFServing, craft a Python model server, build a Docker image, and deploy to Kubernetes with Minikube.
Deploying machine learning models to production is a critical step in turning research and development efforts into practical applications. In this tutorial, we will explore how to deploy Language Model (LLM) models in a Kubernetes cluster using KFServing. We will leverage the power of KFServing to simplify the model serving process, achieve scalability, and ensure seamless integration with existing infrastructure.
To illustrate the relevance of deploying LLM models, let's consider a business use case. Imagine you are building an intelligent chatbot that provides personalized responses to customer queries. By deploying an LLM model, the chatbot can generate contextual and accurate answers, enhancing the overall user experience. With KFServing, you can easily deploy and scale the LLM model, enabling real-time interactions with users.
By the end of this tutorial, you will have a solid understanding of deploying LLM models with KFServing and be ready to apply this knowledge to your own projects.
Before diving into the deployment process, let's briefly discuss the architecture. Our setup comprises a Kubernetes cluster running in Minikube, KFServing as a framework to deploy the services, and a custom LLM model server. The Kubernetes cluster provides the infrastructure for deploying and managing the model. KFServing acts as a serving layer that facilitates standardized model serving across different frameworks. Finally, the custom LLM model server hosts the pre-trained LLM model and handles inference requests.
To follow along with this tutorial, ensure that you have the following prerequisites:
requirements.txt
file.Now that we have our prerequisites, let's proceed with the deployment process.
KFServing is designed to provide a standardized way of serving machine learning models across organizations. It offers high abstraction interfaces for common ML frameworks like TensorFlow, PyTorch, and more. By leveraging KFServing, data scientists and MLOps teams can collaborate seamlessly from model production to deployment. KFServing can be easily integrated into existing Kubernetes and Istio stacks, providing model explainability, inference graph operations, and other model management functions.
To begin, we need to set up KFServing on a Kubernetes cluster. For this tutorial, we'll use the local quick install method on a Minikube Kubernetes cluster. The quick install method allows us to install Istio and KNative without the full Kubeflow setup, making it ideal for local development and testing.
Start by installing the necessary dependencies: kubectl, and Helm 3. We will assume that they are already set up. Then, follow the Minikube install instructions to complete the setup. Adjust the memory and CPU settings for Minikube to ensure smooth functioning. Once the installation is complete, start Minikube and verify the cluster status using the following commands:
minikube start --memory=6144
minikube status
The kfserving-custom-model
requests at least 4Gi of memory, so in this case, we provide it with a bit more.
Now, we'll focus on the code required to build a custom Python model server for the Hugging Face extractive question-answering model. We'll use the KFServing model class and implement the necessary methods. We will start by understanding the code that powers the custom LLM model server. The server is implemented using Python and leverages the Hugging Face transformer library.
Let’s start by creating a new Python file and naming it kf_model_server.py
. Import the required libraries and define the KFServing_BERT_QA_Model
class that inherits from kfserving.KFModel
. This class will handle the model loading and prediction logic:
# Import the required libraries and modules
import kfserving
from typing import List, Dict
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf
import base64
import io
# Define the custom model server class
class kf_serving_model (kfserving.KFModel):
def __init__(self, name: str):
super().__init__(name)
self.name = name
self.ready = False
self.tokenizer = None
def load(self):
# Load the pre-trained model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
self.model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
self.ready = True
def predict(self, request: Dict) -> Dict:
inputs = request["instances"]
# Perform inference on the input instances
source_text = inputs[0]["text"]
questions = inputs[0]["questions"]
results = {}
for question in questions:
# Tokenize the question and source text
inputs = self.tokenizer.encode_plus(question, source_text, add_special_tokens=True, return_tensors="tf")
input_ids = inputs["input_ids"].numpy()[0]
answer_start_scores, answer_end_scores = self.model(inputs)
# Extract the answer from the scores
answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
answer_end = (tf.argmax(answer_end_scores, axis=1) + 1).numpy()[0]
answer = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
results[question] = answer
return {"predictions": results}
if __name__ == "__main__":
model = kf_serving_model("kfserving-custom-model")
model.load()
kfserving.KFServer(workers=1).start([model])
In the above code, we define the kf_serving_model
class that inherits from kfserving.KFModel
and initializes the model and tokenizer. The class encapsulates the model loading and prediction logic. The load()
method loads the pre-trained model and tokenizer from the Hugging Face library. The predict()
method takes the input JSON and performs inference using the model. It generates question-answer pairs and returns them in the response.
Before we proceed, let's discuss some best practices for deploying LLM models with KFServing:
Now that we have a good understanding of the code and best practices, let's proceed with the deployment process.
For the deployment, first, we need to set up the Kubernetes cluster and ensure it is running smoothly. You can use Minikube or a cloud-based Kubernetes service. Once the cluster is running, we install the KFServing CRD by cloning the KFServing repository and navigating to the cloned directory:
git clone git@github.com:kubeflow/kfserving.git
cd kfserving
Now we install the necessary dependencies using the hack/quick_install.sh
script:
./hack/quick_install.sh
To deploy our custom model server, we need to package it into a Docker container image. This allows for easy distribution and deployment across different environments.
Let’s create the Docker image by creating a new file named Dockerfile
in the same directory as the Python file:
# Use the official lightweight Python image.
FROM python:3.7-slim
ENV APP_HOME /app
WORKDIR $APP_HOME
# Install production dependencies.
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt
# Copy local code to the container image
COPY kf_model_server.py ./
CMD ["python", "kf_model_server.py"]
The Dockerfile specifies the base Python image, sets the working directory, installs the dependencies from the requirements.txt
file, and copies the Python code into the container. Here we will be running this locally on a CPU, so we will be using tensorflow-cpu
for the application:
kfserving==0.3.0
transformers==2.1.1
tensorflow-cpu==2.2.0
protobuf==3.20.0
To build the Docker image, execute the following command:
docker build -t kfserving-custom-model .
This command builds the container image using the Dockerfile and tags it with the specified name.
When you build a Docker image using docker build -t kfserving-custom-model .
, the image is only available in your local Docker environment. Kubernetes can't access images from your local Docker environment unless you're using a tool like Minikube or kind with a specific configuration to allow this.
To make the image available to Kubernetes, you need to push it to a Docker registry like Docker Hub, Google Container Registry (GCR), or any other registry accessible to your Kubernetes cluster.
Here are the general steps you need to follow:
Tag your image with the registry address:
If you are using Docker Hub, the command is:
docker tag kfserving-custom-model:latest <your-dockerhub-username>/kfserving-custom-model:latest
Push the image to the registry:
For Docker Hub, the command is:
docker push <your-dockerhub-username>/kfserving-custom-model:latest
Make sure to replace <your-dockerhub-username>
with your actual Docker Hub username. Also, ensure that your Kubernetes cluster has the necessary credentials to pull from the registry if it's private. If it's a public Docker Hub repository, there should be no issues.
Now that we have the Docker image, we can deploy the custom model server as an InferenceService
on KFServing. We'll use a YAML configuration file to describe the Kubernetes model resource. Create a file named deploy_server.yaml
and populate it with the following content:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: kfserving-custom-model
spec:
predictor:
containers:
- image: <your-dockerhub-username>/kfserving-custom-model:latest
name: kfserving-container
resources:
requests:
memory: "4096Mi"
cpu: "250m"
limits:
memory: "4096Mi"
cpu: "500m"
The YAML file defines the model's metadata, including the name and labels. It specifies the container image to use, along with resource requirements for memory and CPU.
To deploy the model, run the following command:
kubectl apply -f deploy_server.yaml
This command creates the InferenceService resource in the Kubernetes cluster, deploying the custom model server.
Verify the deployment status:
kubectl get inferenceservices
This should show you the status of the inference service:
We can see that the containers have downloaded the BERT model and now there are ready to start receiving inference calls.
Once the model is deployed on KFServing, we can make inference calls to the locally hosted Hugging Face QA model. To do this, we'll need to set up port forwarding to expose the model's port to our local system.
Execute the following command to determine if your Kubernetes cluster is running in an environment that supports external load balancers
kubectl get svc istio-ingressgateway -n istio-system
Now we can do Port Forward
for testing purposes:
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80
# start another terminal
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
This command forwards port 8080 on our local system to port 80 of the model's service. It enables us to access the model's endpoint locally.
Next, create a JSON file named kf_input.json
with the following content:
{
"instances": [
{
"text": "Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.",
"questions": [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
"Transformers provides interoperability between which frameworks?"
]
}
]
}
The JSON file contains the input text and a list of questions for the model to answer. To make an inference call, use the CURL command:
curl -v -H "Host: kfserving-custom-model.default.example.com" -d @./kf_input.json <http://localhost:8080/v1/models/kfserving-custom-model:predict>
This command sends the JSON file as input to the predict method of our custom InferenceService. It forwards the request to the model's endpoint. It returns the next predictions:
{"predictions":
{"How many pretrained models are available in Transformers?":
"over 32 +",
"What does Transformers provide?":
"general - purpose architectures",
"Transformers provides interoperability between which frameworks?":
"tensorflow 2 . 0 and pytorch"}
}
We can see the whole operation here:
The response includes the generated question-answer pairs for each one of the specified questions.
In this tutorial, we learned how to deploy Language Model (LLM) models in a Kubernetes cluster using KFServing. We set up KFServing, built a custom Python model server using the Hugging Face extractive question-answering model, created a Docker image for the model server, and deployed the model as an InferenceService on KFServing. We also made inference calls to the hosted model and obtained question-answer pairs. By following this guide, you can deploy your own LLM models in Kubernetes with ease.
Deploying LLM models in Kubernetes with KFServing simplifies the process of serving ML models at scale. It enables collaboration between data scientists and MLOps teams and provides standardized model-serving capabilities. With this knowledge, you can leverage KFServing to deploy and serve your own LLM models efficiently.
Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.