Building an API for Language Model Inference using Rust and Hyper

Introduction

In our previous exploration, we delved deep into the world of Large Language Models (LLMs) in Rust. Through the lens of the llm crate and the transformative potential of LLMs, we painted a picture of the current state of AI integrations within the Rust ecosystem. But knowledge, they say, is only as valuable as its application. Thus, we transition from understanding the 'how' of LLMs to applying this knowledge in real-world scenarios.

Welcome to the second part of our Rust LLM. In this article, we roll up our sleeves to architect and deploy an inference server using Rust. Leveraging the blazingly fast and efficient Hyper HTTP library, our server will not just respond to incoming requests but will think, infer, and communicate like a human. We'll guide you through the step-by-step process of setting up, routing, and serving inferences right from the server, all the while keeping our base anchored to the foundational insights from our last discussion.

For developers eager to witness the integration of Rust, Hyper, and LLMs, this guide promises to be a rewarding endeavor. By the end, you'll be equipped with the tools to set up a server that can converse intelligently, understand prompts, and provide insightful responses. So, as we progress from the intricacies of the llm crate to building a real-world application, join us in taking a monumental step toward making AI-powered interactions an everyday reality.

Imports and Data Structures

Let's start by looking at the import statements and data structures used in the code:

use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Request, Response, Server};
use std::net::SocketAddr;
use serde::{Deserialize, Serialize};
use std::{convert::Infallible, io::Write, path::PathBuf};

hyper: Hyper is a fast and efficient HTTP library for Rust.
SocketAddr: This is used to specify the socket address (IP and port) for the server.
serde: Serde is a powerful serialization/deserialization framework in Rust.
Deserialize, Serialize: Serde traits for automatic serialization and deserialization.

Next, we have the data structures that will be used for deserializing JSON request data and serializing response data:

#[derive(Debug, Deserialize)]
struct ChatRequest {
   prompt: String,
}
#[derive(Debug, Serialize)]
struct ChatResponse {
   response: String,
}

1. ChatRequest: A struct to represent the incoming JSON request containing a prompt field.
2. ChatResponse: A struct to represent the JSON response containing a response field.

Inference Function

The infer function is responsible for performing language model inference:

fn infer(prompt: String) -> String {
   let tokenizer_source = llm::TokenizerSource::Embedded;
   let model_architecture = llm::ModelArchitecture::Llama;
   let model_path = PathBuf::from("/path/to/model");
   let prompt = prompt.to_string();
   let now = std::time::Instant::now();
   let model = llm::load_dynamic(
       Some(model_architecture),
       &model_path,
       tokenizer_source,
       Default::default(),
       llm::load_progress_callback_stdout,
   )
   .unwrap_or_else(|err| {
       panic!("Failed to load {} model from {:?}: {}", model_architecture, model_path, err);
   });
   println!(
       "Model fully loaded! Elapsed: {}ms",
       now.elapsed().as_millis()
   );
   let mut session = model.start_session(Default::default());
   let mut generated_tokens = String::new(); // Accumulate generated tokens here
   let res = session.infer::<Infallible>(
       model.as_ref(),
       &mut rand::thread_rng(),
       &llm::InferenceRequest {
           prompt: (&prompt).into(),
           parameters: &llm::InferenceParameters::default(),
           play_back_previous_tokens: false,
           maximum_token_count: Some(140),
       },
       // OutputRequest
       &mut Default::default(),
       |r| match r {
           llm::InferenceResponse::PromptToken(t) | llm::InferenceResponse::InferredToken(t) => {
               print!("{t}");
               std::io::stdout().flush().unwrap();
               // Accumulate generated tokens
               generated_tokens.push_str(&t);
               Ok(llm::InferenceFeedback::Continue)
           }
           _ => Ok(llm::InferenceFeedback::Continue),
       },
   );
   // Return the accumulated generated tokens
   match res {
       Ok(_) => generated_tokens,
       Err(err) => format!("Error: {}", err),
   }
}

The infer function takes a prompt as input and returns a string containing generated tokens.
It loads a language model, sets up an inference session, and accumulates generated tokens.
The res variable holds the result of the inference, and a closure handles each inference response.
The function returns the accumulated generated tokens or an error message.

Request Handler

The chat_handler function handles incoming HTTP requests:

async fn chat_handler(req: Request<Body>) -> Result<Response<Body>, Infallible> {
   let body_bytes = hyper::body::to_bytes(req.into_body()).await.unwrap();
   let chat_request: ChatRequest = serde_json::from_slice(&body_bytes).unwrap();
   // Call the `infer` function with the received prompt
   let inference_result = infer(chat_request.prompt);
   // Prepare the response message
   let response_message = format!("Inference result: {}", inference_result);
   let chat_response = ChatResponse {
       response: response_message,
   };
   // Serialize the response and send it back
   let response = Response::new(Body::from(serde_json::to_string(&chat_response).unwrap()));
   Ok(response)
}

chat_handler asynchronously handles incoming requests by deserializing the JSON payload.
It calls the infer function with the received prompt and constructs a response message.
The response is serialized as JSON and sent back in the HTTP response.

Router and Not Found Handler

The router function maps incoming requests to the appropriate handlers:

The router function maps incoming requests to the appropriate handlers:
async fn router(req: Request<Body>) -> Result<Response<Body>, Infallible> {
   match (req.uri().path(), req.method()) {
       ("/api/chat", &hyper::Method::POST) => chat_handler(req).await,
       _ => not_found(),
   }
}

router matches incoming requests based on the path and HTTP method.
If the path is "/api/chat" and the method is POST, it calls the chat_handler.
If no match is found, it calls the not_found function.

Main Function

The main function initializes the server and starts listening for incoming connections:

#[tokio::main]
async fn main() {
   println!("Server listening on port 8083...");
   let addr = SocketAddr::from(([0, 0, 0, 0], 8083));
   let make_svc = make_service_fn(|_conn| {
       async { Ok::<_, Infallible>(service_fn(router)) }
   });
   let server = Server::bind(&addr).serve(make_svc);
   if let Err(e) = server.await {
       eprintln!("server error: {}", e);
   }
}

In this section, we'll walk through the steps to build and run the server that performs language model inference using Rust and the Hyper framework. We'll also demonstrate how to make a POST request to the server using Postman.

1. Install Rust: If you haven't already, you need to install Rust on your machine. You can download Rust from the official website: https://www.rust-lang.org/tools/install

2. Create a New Rust Project: Create a new directory for your project and navigate to it in the terminal. Run the following command to create a new Rust project:

       cargo new language_model_server

This command will create a new directory named language_model_server containing the basic structure of a Rust project.

3. Add Dependencies: Open the Cargo.toml file in the language_model_server directory and add the required dependencies for Hyper and other libraries. Your Cargo.toml file should look something like this:

		[package]
		name = "llm_handler"
		version = "0.1.0"
		edition = "2018"
		[dependencies]
		hyper = {version = "0.13"}
		tokio = {  version = "0.2", features = ["macros", "rt-threaded"]}
		serde = {version = "1.0", features = ["derive"] }
		serde_json = "1.0"
		llm = { git = "<https://github.com/rustformers/llm.git>" }
		rand = "0.8.5"

Make sure to adjust the version numbers according to the latest versions available.

4. Replace Code: Replace the content of the src/main.rs file in your project directory with the code you've been provided in the earlier sections.

5. Building the Server: In the terminal, navigate to your project directory and run the following command to build the server:

        cargo build --release

This will compile your code and produce an executable binary in the target/release directory.

Running the Server

1. Running the Server: After building the server, you can run it using the following command:

        cargo run --release

Your server will start listening on the port 8083.

2. Accessing the Server: Open a web browser and navigate to http://localhost:8083. You should see the message "Not Found" indicating that the server is up and running.

Making a POST Request Using Postman

1. Install Postman: If you don't have Postman installed, you can download it from the official website: https://www.postman.com/downloads/

2. Create a POST Request:

o Open Postman and create a new request.

o Set the request type to "POST".

o Enter the URL: http://localhost:8083/api/chat

o In the "Body" tab, select "raw" and set the content type to "JSON (application/json)".

o Enter the following JSON request body:

			{
  			"prompt": "Rust is an amazing programming language because"
			}

3. Send the Request: Click the "Send" button to make the POST request to your server.

building-an-api-for-language-model-inference-using-rust-and-hyper-part-2-img-0

4. View the Response: You should receive a response from the server, indicating the inference result generated by the language model.

building-an-api-for-language-model-inference-using-rust-and-hyper-part-2-img-1

Conclusion

In the previous article, we introduced the foundational concepts, setting the stage for the hands-on application we delved into this time. In this article, our main goal was to bridge theory with practice. Using the llm crate alongside the Hyper library, we embarked on a mission to create a server capable of understanding and executing language model inference. But our work was more than just setting up a server; it was about illustrating the synergy between Rust, a language famed for its safety and concurrency features, and the vast world of AI.

What's especially encouraging is how this project can serve as a springboard for many more innovations. With the foundation laid out, there are numerous avenues to explore, from refining the server's performance to integrating more advanced features or scaling it for larger audiences.

If there's one key takeaway from our journey, it's the importance of continuous learning and experimentation. The tech landscape is ever-evolving, and the confluence of AI and programming offers a fertile ground for innovation.

As we conclude this series, our hope is that the knowledge shared acts as both a source of inspiration and a practical guide. Whether you're a seasoned developer or a curious enthusiast, the tools and techniques we've discussed can pave the way for your own unique creations. So, as you move forward, keep experimenting, iterating, and pushing the boundaries of what's possible. Here's to many more coding adventures ahead!

Author Bio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, and Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder of startups, and later on earned a Master's degree from the faculty of Mathematics at the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.