Efficient LLM Querying with LMQL

Introduction

In the world of natural language processing, Large Language Models (LLMs) have proven to be highly successful at a variety of language-based tasks, such as machine translation, text summarization, question answering, reasoning, and code generation. LLMs like ChatGPT, GPT-4, and others have demonstrated outstanding performance by predicting the next token in a sequence based on input prompts. Users interact with these models by providing language instructions or examples to perform various downstream tasks. However, to achieve optimal results or adapt LLMs for specific tasks, complex and task-specific programs must be implemented, often requiring ad-hoc interactions and deep knowledge of the model's internals.

In this article, we discuss LMQL, a framework for Language Model Programming (LMP), that allows users to specify complex interactions, control flow, and constraints without needing deep knowledge of the LLM's internals using a declarative programming language similar to SQL. LMQL supports high-level, logical constraints and users can express a wide range of prompting techniques concisely, reducing the need for ad-hoc interactions and manual work to steer model generation, avoiding costly re-querying, and guiding the text generation process according to their specific criteria. Let’s start.

Overview of Large Language Models

Language models (LMs) operate on sequences of tokens, where tokens are discrete elements that represent words or sub-words in a text. The process involves using a tokenizer to map input words to tokens, and then a language model predicts the probabilities of possible next tokens based on the input sequence. Various decoding methods are used in the LMs to output the right sequence of tokens from the language model's predictions out of which we can name:

Decoding Methods:
- Greedy decoding: Select the token with the highest probability at each step.
- Sampling: Randomly sampling tokens based on the predicted probabilities.
- Full decoding: Enumerating all possible sequences and selecting the one with the highest probability (computationally expensive).
- Beam search: Maintaining a set of candidate sequences and refining them by predicting the next token.
Masked Decoding: In some cases, certain tokens can be ruled out based on a mask that indicates which tokens are viable. Decoding is then performed on the remaining set of tokens.
Few-Shot Prompting: LMs can be trained on broad text-sequence prediction datasets and then provided with context in the form of examples for specific tasks. This approach allows LMs to perform downstream tasks without task-specific training.
Multi-Part Prompting: LMs are used not only for simple prompt completion but also as reasoning engines integrated into larger programs. Various LM programming schemes explore compositional reasoning, such as iterated decompositions, meta prompting, tool use, and composition of multiple prompts.

It is also important to name that for beam searching and sampling there is a parameter named temperature which we can use to control the diversity of the output.

These techniques enable LMs to be versatile and perform a wide range of tasks without requiring task-specific training, making them powerful multi-task reasoners.

Asking the Right Questions

While LLMs can be prompted with examples or instructions, using them effectively and adapting to new models often demands a deep understanding of their internal workings, along with the use of vendor-specific libraries and implementations. Constrained decoding to limit text generation to legal words or phrases can be challenging. Many advanced prompting methods require complex interactions and control flows between the LLM and the user, leading to manual work and restricting the generality of implementations. Additionally, generating complete sequences from LLMs may require multiple calls and become computationally expensive, resulting in high usage costs per query in pay-to-use APIs. Generally, the challenges that can associated with creating proper promts for LLMs are:

Interaction Challenge: One challenge in LM interaction is the need for multiple manual interactions during the decoding process. For example, in meta prompting, where the language model is asked to expand the prompt and then provide an answer, the current approach requires inputting the prompt partially, invoking the LM, extracting information, and manually completing the sequence. This manual process may involve human intervention or several API calls, making joint optimization of template parameters difficult and limiting automated optimization possibilities.
Constraints & Token Representation: Another issue arises when considering completions generated by LMs. Sometimes, LMs may produce long, ongoing sequences of text that do not adhere to desired constraints or output formats. Users often have specific constraints for the generated text, which may be violated by the LM. Expressing these constraints in terms of human-understandable concepts and logic is challenging, and existing methods require considerable manual implementation effort and model-level understanding of decoding procedures, tokenization, and vocabulary.
Efficiency and Cost Challenge: Efficiency and performance remain significant challenges in LM usage. While efforts have been made to improve the inference step in modern LMs, they still demand high-end GPUs for reasonable performance. This makes practical usage costly, particularly when relying on hosted models running in the cloud with paid APIs. The computational and financial expenses associated with frequent LM querying can become prohibitive.

Addressing these challenges, Language Model Programming and constraints offer new optimization opportunities. By defining behavior and limiting the search space, the number of LM invocations can be reduced. In this context, the cost of validation, parsing, and mask generation becomes negligible compared to the significant cost of a single LM call.

So the question arises, how can we overcome the challenges of implementing complex interactions and constraints with LLMs while reducing computational costs and retaining or improving accuracy on downstream tasks?

Introducing LMQL

To address these challenges and enhance language model programming, a team of researchers has introduced LMQL (Language Model Query Language). LMQL is an open-source programming language and platform for LLM interaction that combines prompts, constraints, and scripting. It is designed to elevate the capabilities of LLMs like ChatGPT, GPT-4, and any future models, offering a declarative, SQL-like approach based on Python.

LMQL enables Language Model Programming (LMP), a novel paradigm that extends traditional natural language prompting by allowing lightweight scripting and output constraining. This separation of front-end and back-end interaction allows users to specify complex interactions, control flow, and constraints without needing deep knowledge of the LLM's internals. This approach abstracts away tokenization, implementation, and architecture details, making it more portable and easier to use across different LLMs.

With LMQL, users can express a wide range of prompting techniques concisely, reducing the need for ad-hoc interactions and manual work. The language supports high-level, logical constraints, enabling users to steer model generation and avoid costly re-querying and validation. By guiding the text generation process according to specific criteria, users can achieve the desired output with fewer iterations and improved efficiency.

Moreover, LMQL leverages evaluation semantics to automatically generate token masks for LM decoding based on user-specified constraints. This optimization reduces inference cost by up to 80%, resulting in significant latency reduction and lower computational expenses, particularly beneficial for pay-to-use APIs.

LMQL ddresses certain challenges in LM interaction and usage which are namely.

Overcoming Manual Interaction: LMQL simplifies the prompt and eliminates the need for manual interaction during the decoding process. It achieves this by allowing the use of variables, represented within square brackets, which store the answers obtained from the language model. These variables can be referenced later in the query, avoiding the need for manual extraction and input. By employing LMQL syntax, the interaction process becomes more automated and efficient.
Constraints on Variable Parts: To address issues related to long and irrelevant outputs, LMQL introduces constraints on the variable parts of LM interaction. These constraints allow users to specify word and phrase limitations for the generated text. LMQL ensures that the decoded tokens for variables meet these constraints during the decoding process. This provides more control over the generated output and ensures that it adheres to user-defined restrictions.
Generalization of Multi-Part Prompting: Language Model Programming through LMQL generalizes various multi-part prompting approaches discussed earlier. It streamlines the process of trying different values for variables by automating the selection process. Users can set constraints on variables, which are then applied to multiple inputs without any human intervention. Once developed and tested, an LMQL query can be easily applied to different inputs in an unsupervised manner, eliminating the need for manual trial and error.
Efficient Execution: LMQL offers efficiency benefits over manual interaction. The constraints and scripting capabilities in LMQL are applied eagerly during decoding, reducing the number of times the LM needs to be invoked. This optimized approach results in notable time and cost savings, especially when using hosted models in cloud environments.

The LMQL syntax involves components such as the decoder, the actual query, the model to query, and the constraints. The decoder specifies the decoding procedure, which can include argmax, sample, or beam search. LMQL allows for constraints on the generated text using Python syntax, making it more user-friendly and easily understandable. Additionally, the distribution instruction allows users to augment the returned result with probability distributions, which is useful for tasks like sentiment analysis.

Using LMQL with Python

LMQL can be utilized in various ways - as a standalone language, in the Playground, or even as a Python library being the latter what we will demonstrate now. Integrating LMQL into Python projects allows users to streamline their code and incorporate LMQL queries seamlessly. Let's explore how to use LMQL as a Python library and understand some examples.

To begin, make sure you have LMQL and LangChain installed by running the following command:

!pip install lmql==0.0.6.6 langchain==0.0.225

You can then define and execute LMQL queries within Python using a simple approach. Decorate a Python function with the lmql.query decorator, providing the query code as a multi-line string. The decorated function will automatically be compiled into an LMQL query. The return value of the decorated function will be the result of the LMQL query.

Here's an example code snippet demonstrating this:

import lmql
import aiohttp
import os
 
os.environ['OPENAI_API_KEY'] = '<your-openai-key>'
 
@lmql.query
async def hello():
    '''lmql
    argmax
        "Hello[WHO]"
    from
        "openai/text-ada-001"
    where
        len(TOKENS(WHO)) < 10
    '''
 
print(await hello())

LMQL provides a fully asynchronous API that enables running multiple LMQL queries in parallel. By declaring functions as async with @lmql.query, you can use await to execute the queries concurrently.

The code below demonstrates how to look up information from Wikipedia and incorporate it into an LMQL prompt dynamically:

async def look_up(term):
    # Looks up term on Wikipedia
    url = f"<https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles={term}&origin=*>"
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            # Get the first sentence on the first page
            page = (await response.json())["query"]["pages"]
            return list(page.values())[0]["extract"].split(".")[0]

@lmql.query
async def greet(term):
    '''
    argmax
        """Greet {term} ({await look_up(term)}):
        Hello[WHO]
        """
    from
        "openai/text-davinci-003"
    where
        STOPS_AT(WHO, "\\n")
    '''

print((await greet("Earth"))[0].prompt)

As an alternative to @lmql.query you can use lmql.query(...) as a function that compiles a provided string of LMQL code into a Python function.

q = lmql.query('argmax "Hello[WHO]" from "openai/text-ada-001" where len(TOKENS(WHO)) < 10')
await q()

LMQL queries can also be easily integrated into langchain's Chain components. This allows for sequential prompting using multiple queries.

pythonCopy code
from langchain import LLMChain, PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (ChatPromptTemplate, HumanMessagePromptTemplate)
from langchain.llms import OpenAI

# Setup the LM to be used by langchain
llm = OpenAI(temperature=0.9)

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template="What is a good name for a company that makes {product}?",
        input_variables=["product"],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])
chat = ChatOpenAI(temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

# Run the chain
chain.run("colorful socks")

Lastly, by treating LMQL queries as Python functions, you can easily build pipelines by chaining functions together. Furthermore, the guaranteed output format of LMQL queries ensures ease of processing the returned values using data processing libraries like Pandas.

Here's an example of processing the output of an LMQL query with Pandas:

pythonCopy code
import pandas as pd

@lmql.query
async def generate_dogs(n: int):
    '''lmql
    sample(n=n)
        """Generate a dog with the following characteristics:
        Name:[NAME]
        Age: [AGE]
        Breed:[BREED]
        Quirky Move:[MOVE]
        """
    from
        "openai/text-davinci-003"
    where
        STOPS_BEFORE(NAME, "\\n") and STOPS_BEFORE(BREED, "\\n") and
        STOPS_BEFORE(MOVE, "\\n") and INT(AGE) and len(AGE) < 3
    '''

result = await generate_dogs(8)
df = pd.DataFrame([r.variables for r in result])
df

By employing LMQL as a Python library, users can make their code more efficient and structured, allowing for easier integration with other Python libraries and tools.

LMQL can be used in various ways - as a standalone language, in the Playground, or even as a Python library. When integrated into Python projects, LMQL queries can be executed seamlessly. Below, we provide a brief overview of using LMQL as a Python library.

Conclusion

LMQL introduces an efficient and powerful approach to interact with language models, revolutionizing language model programming. By combining prompts, constraints, and scripting, LMQL offers a user-friendly interface for working with large language models, significantly improving efficiency and accuracy across diverse tasks. Its capabilities allow developers to leverage the full potential of language models without the burden of complex implementations, making language model interaction more accessible and cost-effective.

With LMQL, users can overcome challenges in LM interaction, including manual interactions, constraints on variable parts, and generalization of multi-part prompting. By automating the selection process and eager application of constraints during decoding, LMQL reduces the number of LM invocations, resulting in substantial time and cost savings. Moreover, LMQL's declarative, SQL-like approach simplifies the development process and abstracts away tokenization and implementation details, making it more portable and user-friendly.

In conclusion, LMQL represents a promising advancement in the realm of large language models and language model programming. Its efficiency, flexibility, and ease of use open up new possibilities for creating complex interactions and steering model generation without deep knowledge of the model's internals. By embracing LMQL, developers can make the most of language models, unleashing their potential across a wide range of language-based tasks with heightened efficiency and reduced computational costs.

Author Bio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, and Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder of startups, and later on earned a Master's degree from the faculty of Mathematics at the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.