Your (soon-to-be) intelligent app
With LLMs, embedding models, vector databases, and model hosting, you have the key building blocks for creating intelligent applications. While the specific architecture will vary depending on your use case, a common pattern emerges:
- LLMs for reasoning and generation
- Embeddings and vector search for retrieval and memory
- Model hosting to serve these components at scale
This AI stack is integrated with traditional application components, such as backend services, APIs, frontend user interfaces, databases, and data pipelines. Additionally, intelligent applications often include components for AI-specific concerns, such as prompt management and optimization, data preparation and embedding generation, and AI safety, testing, and monitoring.
The rest of this section walks through an example architecture for a RAG-powered chatbot, showcasing how these components work together. The subsequent chapters will dive deeper into the end-to-end process of building production-grade intelligent applications.
Sample application – RAG chatbot
Consider a simple chatbot application that leverages RAG that lets users talk to some documentation.
There are seven key components of this application:
- Chatbot UI: A website with a simple chatbot UI that communicates with the web server
- Web server: A Python Flask server to manage conversations between the user and the LLM
- Data ingestion extract, transform, load (ETL) pipeline: A Python script that ingests data from the data sources
- Embedding model: The OpenAI
text-embedding-3-small
model, hosted by OpenAI - LLM: The OpenAI
gpt-4-turbo
model, hosted by OpenAI - Vector store: MongoDB Atlas Vector Search
- MongoDB Atlas: A database-as-a-service for persisting conversations
Note
This simple example application does not include evaluation or observability modules.
In this architecture, there are two key data flows:
- Chat interaction: The user communicates with the chatbot with RAG
- Data ingestion: Bringing data from its original sources into the vector database
In the chat interaction, the chatbot UI communicates with the chatbot web server, which in turn interacts with the LLM, embedding model, and vector store. This occurs for every message that the user sends to the chatbot. Figure 2.1 shows the data flow for the chatbot application:
Figure 2.1: An example of a basic RAG chatbot conversation data flow
The data flow illustrated in Figure 2.1 can be described as follows:
- The user sends a message to the chatbot from the web UI.
- The web UI creates a request to the server with the user’s message.
- The web server sends a request to the embedding model API to create a vector embedding for the user query. The embedding model API responds with the corresponding vector embedding.
- The web server performs a vector search in the vector database using the query vector embedding. The vector store responds with the matching vector search results.
- The server constructs a message that the LLM will respond to. This message consists of a system prompt and a new message that includes the user’s original message and the content retrieved from the vector search. The LLM then responds to the user message.
- The server saves the conversation state to the database.
- The server returns the LLM-generated message to the user in a response to the original request from the web UI.
A data ingestion pipeline prepares and enriches data, generates embeddings using the embedding model, and populates the vector store and traditional database. This pipeline runs as a batch job every 24 hours. Figure 2.2 shows an example of a data ingestion pipeline:
Figure 2.2: An example of a RAG chatbot data ingestion ETL pipeline
Let’s look at the data flow shown in Figure 2.2:
- The data ingestion ETL pipeline pulls in data from various data sources.
- The ETL pipeline cleans the data into a consistent format. It also breaks the data into chunks of data.
- The ETL pipeline calls the embedding model API to generate a vector embedding for each data chunk.
- The ETL pipeline stores the chunks along with their vector embeddings in a vector database.
- The vector database indexes the embeddings for use with vector search.
While a simple architecture like this can be used to build compelling prototypes, transitioning from prototype to production and continuously iterating on the application requires addressing many additional considerations:
- Data ingestion strategy: Acquiring, cleaning, and preparing the data that will be ingested into the vector store or database for retrieval.
- Advanced retrieval patterns: Incorporating techniques for efficient and accurate retrieval of relevant information from the vector store or database, such as combining semantic search with traditional filtering, AI-based reranking, and query mutation.
- Evaluation and testing: Adding modules for evaluating model outputs, testing end-to-end application flows, and monitoring for potential biases or errors.
- Scalability and performance optimization: Implementing optimizations such as caching, load balancing, and efficient resource management to handle increasing workloads and ensure consistent responsiveness.
- Security and privacy: Securing the application to ensure that users can only interact with data that they have permission to, so that user data is handled in accordance with relevant policies, standards, and laws.
- User experience and interaction design: Incorporating new generative AI interfaces and interaction patterns, such as streaming responses, answer confidence, and source citation.
- Continuous improvement and model updates: Building processes and systems to update AI models safely and reliably and hyperparameters in the intelligent application.
Implications of intelligent applications for software engineering
The rise of intelligent applications has significant implications for how software is made. Developing these intelligent applications requires an extension of traditional development skills. The AI engineer must possess an understanding of prompt engineering, vector search, and evaluation, as well as familiarity with the latest AI techniques and architectures. While a complete understanding of the underlying neural networks is not necessary, basic knowledge of natural language processing (NLP) is helpful.
Intelligent application development also introduces new challenges and considerations, such as data management and integration with AI components, testing and debugging of AI-driven functionality, and addressing the ethical, safety, and security implications of AI outputs. The compute-heavy nature of AI workloads also necessitates focusing on scalability and cost optimization. Developers building traditional software generally do not need to face such concerns.
To address these challenges, software development teams must adapt their processes and adopt novel approaches and best practices. This entails implementing AI governance, bridging the gap between software and ML/AI teams, and adjusting the development lifecycle for intelligent app needs.