Building an API for Language Model Inference using Rust and Hyper

Introduction

In the landscape of artificial intelligence, the capacity to bring sophisticated Large Language Models (LLMs) to commonplace applications has always been a sought-after goal. Enter LLM, a groundbreaking Rust library crafted by Rustformers, designed to make this dream a tangible reality. By focusing on the intricate synergy between the LLM library and the foundational GGML project, this toolset pushes the boundaries of what's possible, enabling AI enthusiasts to harness the sheer might of LLMs on conventional CPUs. This shift in dynamics owes much to GGML's pioneering approach to model quantization, streamlining computational requirements without sacrificing performance.

In this comprehensive guide, we'll embark on a journey that starts with understanding the essence of the llm crate and its seamless interaction with a myriad of LLMs. Delving into its intricacies, we'll illuminate how to integrate, interact, and infer using these models. And as a tantalizing glimpse into the realm of practical application, our expedition won't conclude here. In the subsequent installment, we'll rise to the challenge of crafting a web server in Rust—one that confidently runs inference directly on a CPU, making the awe-inspiring capabilities of AI not just accessible, but an integral part of our everyday digital experiences.

This is a two-part article in the first section we will discuss the basic interaction with the library and in the following we build a server in Rust that allow us to build our own web applications using state-of-the-art LLMs. Let’s begin with it.

Harnessing the Power of Large Language Models

At the very core of LLM's architecture resides the GGML project, a tensor library meticulously crafted in the C programming language. GGML, short for "General GPU Machine Learning," serves as the bedrock of LLM, enabling the intricate orchestration of large language models. Its quintessence lies in a potent technique known as model quantization.

Model quantization, a pivotal process employed by GGML, involves the reduction of numerical precision within a machine-learning model. This entails transforming the conventional 32-bit floating-point numbers frequently used for calculations into more compact representations such as 16-bit or even 8-bit integers.

Quantization can be considered as the act of chiseling away unnecessary complexities while sculpting a model. Model quantization adeptly streamlines resource utilization without inordinate compromises on performance. By default, models lean on 32-bit floating-point numbers for their arithmetic operations. With quantization, this intricacy is distilled into more frugal formats, such as 16-bit integers or even 8-bit integers. It's an artful equilibrium between computational efficiency and performance optimization.

GGML's versatility can be seen through a spectrum of quantization strategies: spanning 4, 5, and 8-bit quantization. Each strategy allows for improvement in efficiency and execution in different ways. For instance, 4-bit quantization thrives in memory and computational frugality, although it could potentially induce a performance decrease compared to the broader 8-bit quantization.

The Rustformers library allows to integration of different language models including Bloom, GPT-2, GPT-J, GPT-NeoX, Llama, and MPT. To use these models within the Rustformers library, they undergo a transformation to align with GGML's technical underpinnings. The authorship has generously provided pre-engineered models on the Hugging Face platform, facilitating seamless integration.

In the next sections, we will use the llm crate to run inference on LLM models like Llama. The realm of AI innovation is beckoning, and Rustformers' LLM, fortified by GGML's techniques, forms an alluring gateway into its intricacies.

Getting Started with LLM-CLI

The Rustformers group has the mission of amplifying access to the prowess of large language models (LLMs) at the forefront of AI evolution. The group focuses on harmonizing with the rapidly advancing GGML ecosystem – a C library harnessed for quantization, enabling the execution of LLMs on CPUs. The trajectory extends to supporting diverse backends, embracing GPUs, Wasm environments, and more.

For Rust developers venturing into the realm of LLMs, the key to unlocking this potential is the llm crate – the gateway to Rustformers' innovation. Through this crate, Rust developers interface with LLMs effortlessly. The "llm" project also offers a streamlined CLI for interacting with LLMs and examples showcasing its integration into Rust projects. More insights can be gained from the GitHub repository or its official documentation for released versions.

To embark on your LLM journey, initiate by installing the LLM-CLI package. This package materializes the model's essence onto your console, allowing for direct inference.

Getting started is a streamlined process:

Clone the repository.
Install the llm-cli tool from the repository.
Download your chosen models from Hugging Face. In our illustration, we employ the Llama model with 4-bit quantization.
Run inference on the model using the CLI tool and reference the model and architecture of the model downloaded previously.

So let’s start with it. First, let's install llm-cli using this command:

cargo install llm-cli --git <https://github.com/rustformers/llm>

Next, we proceed by fetching your desired model from Hugging Face:

curl -LO <https://huggingface.co/rustformers/open-llama-ggml/resolve/main/open_llama_3b-f16.bin>

Finally, we can initiate a dialogue with the model using a command akin to:

llm infer -a llama -m open_llama_3b-f16.bin -p "Rust is a cool programming language because"

We can see how the llm crate stands to facilitate seamless interactions with LLMs.

building-an-api-for-language-model-inference-using-rust-and-hyper-part-1-img-0

This project empowers developers with streamlined CLI tools, exemplifying the LLM integration into Rust projects. With installation and model preparation effortlessly explained, the journey toward LLM proficiency commences. As we transition to the culmination of this exploration, the power of LLMs is within reach, ready to reshape the boundaries of AI engagement.

Conclusion: The Dawn of Accessible AI with Rust and LLM

In this exploration, we've delved deep into the revolutionary Rust library, LLM, and its transformative potential to bring Large Language Models (LLMs) to the masses. No longer is the prowess of advanced AI models locked behind the gates of high-end GPU architectures. With the symbiotic relationship between the LLM library and the underlying GGML tensor architecture, we can seamlessly run language models on standard CPUs. This is made possible largely by the potent technique of model quantization, which GGML has incorporated. By optimizing the balance between computational efficiency and performance, models can now run in environments that were previously deemed infeasible.

The Rustformers' dedication to the cause shines through their comprehensive toolset. Their offerings extend from pre-engineered models on Hugging Face, ensuring ease of integration, to a CLI tool that simplifies the very interaction with these models. For Rust developers, the horizon of AI integration has never seemed clearer or more accessible.

As we wrap up this segment, it's evident that the paradigm of AI integration is rapidly shifting. With tools like the llm crate, developers are equipped with everything they need to harness the full might of LLMs in their Rust projects. But the journey doesn't stop here. In the next part of this series, we venture beyond the basics, and into the realm of practical application. Join us as we take a leap forward, constructing a web server in Rust that leverages the llm crate.

Author Bio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, and Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder of startups, and later on earned a Master's degree from the faculty of Mathematics at the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.