Implementing a RAG-enhanced CookBot

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

This article is an excerpt from the book, Vector Search for Practitioners with Elastic, by Bahaaldine Azarmi and Jeff Vestal. Optimize your search capabilities in Elastic by operationalizing and fine-tuning vector search and enhance your search relevance while improving overall search performance

Introduction

Everyone knows the age-old culinary dilemma, “What can I cook with the ingredients I have?” Many people need help when faced with an array of ingredients but a lack of inspiration or knowledge to whip up a dish. This everyday issue was the spark for our idea—CookBot.

CookBot is not just any AI. It’s conceived as an advanced culinary assistant that not only suggests recipes based on the available ingredients but also understands the nuances of user queries, adapts to individual dietary preferences and restrictions, and generates insightful culinary recommendations.

Our objective was to infuse CookBot with RAG, ELSER, and RRF technologies. These technologies are designed to enhance the semantic understanding of queries, optimize information retrieval, and generate relevant, personalized responses. By harnessing the capabilities of these advanced tools, we aimed for CookBot to be able to provide seamless, context-aware culinary assistance tailored to each user’s unique needs.

implementing-a-rag-enhanced-cookbot-part-1-img-0

Figure: CookBot powered by Elastic

Dataset overview – an introduction to the Allrecipes.com dataset

The Allrecipes.com dataset, in its raw CSV format, is a treasure trove of diverse and detailed culinary information. Thus, it is the perfect foundation to train our CookBot. It houses an extensive range of recipes, each encapsulated in a unique entry brimming with an array of information.

You can find and download the dataset here, as it will be used later in the chapter:

https://www.kaggle.com/datasets/nguyentuongquang/all-recipes

To illustrate the richness of this dataset, let’s consider a single entry:

"group","name","rating","n_rater","n_ reviewer","summary","process","ingredient"

"breakfast-and-brunch.eggs.breakfast-burritos","Ham and Cheese 
Breakfast Tortillas",0,0,44,"This is great for a special brunch or even a quick and easy dinner. Other breakfast meats can be used, but the deli ham is the easiest since it is already fully cooked.","prep: 
30 mins,total: 30 mins,Servings: 4,Yield: 4 servings","12 eggs + <U+2153> cup milk + 3 slices cooked ham, diced + 2 green onions, minced + salt and pepper to taste + 4 ounces Cheddar cheese, shredded 
+ 4 (10 inch) flour tortillas +  cup salsa"

Each entry in the dataset represents a unique recipe and encompasses various fields:

group: This is the categorization of the recipe. It provides a general idea about the type and nature of the dish.
name: This is the title or name of the recipe. This field straightforwardly tells us what the dish is. • rating and n_rater: These fields indicate the popularity and approval of the dish among users.
n_reviewer: This is the number of users who have reviewed the recipe.
summary: This is a brief overview or description of the dish, often providing valuable context about its taste, usage, or preparation.
process: This field outlines crucial details such as preparation time, total cooking time, servings, and yield.
ingredient: This is a comprehensive list of all the ingredients required for the recipe, along with their quantities.

The detailed information offered by each field gives us a broad and varied information space, aiding the retriever in navigating the data and ensuring the generator can accurately respond to a diverse range of culinary queries. As we move forward, we will discuss how we indexed this dataset using Elasticsearch, the role of ELSER and RRF in effectively retrieving data, and how the GPT-4 model generates relevant, personalized responses based on the retrieved data.

Preparing data for RAG-enhanced search

To transform the Allrecipes.com data into a searchable database, we first need to parse the CSV file and subsequently create an Elasticsearch index where data will be stored and queried. Let’s walk through this process implemented as part of the Python code.

Connecting to Elasticsearch

First, we need to establish a connection with our Elasticsearch instance. This connection is handled by the Elasticsearch object from the Elasticsearch Python module:

from elasticsearch import Elasticsearch
es = Elasticsearch()

In this case, we assume that our Elasticsearch instance runs locally with default settings. If it doesn’t, we will need to provide the appropriate host and port information to the Elasticsearch class.

Defining the index

The next step is to define an index where our recipes will be stored. An index in Elasticsearch is like a database in traditional database systems. In this case, we’ll call our index recipes:

index_name = 'recipes'

Creating the mapping

Now, we need to create a mapping for our index. A mapping is like a schema in a SQL database and defines the types of each field in the documents that will be stored in the index. We will define a mapping as a Python dictionary:

mapping = {
 "mappings": {
   "properties": {
     "group": { "type": "text" },
     "name": { "type": "text" },
     "rating": { "type": "text" },
     "n_rater": { "type": "text" },
     "n_reviewer": { "type": "text" },
     "summary": {
       "type": "text",
       "analyzer": "english"
     },
     "process": { "type": "text" },
     "ingredient": {
       "type": "text",
     },
     "ml.tokens": {
       "type": "rank_features"
     }
   }
 }
}

Here, all fields are defined as text, which means they are full-text searchable. We also specify that the summary field should be analyzed using the English analyzer, which will help to optimize searches in English text by taking into account things such as stemming and stop words. Finally, we create the field that ELSER will use to create the token set, which is the result of the expansion happening based on the terms passed to ELSER.

Creating the index

Once we’ve defined our mapping, we can create the index in Elasticsearch with the following:

es.indices.create(index=index_name, body=mapping)

This sends a request to Elasticsearch to create an index with the specified name and mapping.

Reading the CSV file

With our index ready, we can now read our dataset from the CSV file. We’ll use pandas, a powerful data manipulation library in Python, to do this:

import pandas as pd
with open('recipe_dataset.csv', 'r', encoding='utf-8', errors='ignore') as file:
   df = pd.read_csv(file)

This code opens the CSV file and reads it into a pandas dataframe, a two-dimensional tabular data structure that’s perfect for manipulating structured data.

Converting to dictionaries

To index the data into Elasticsearch, we need to convert our dataframe into a list of dictionaries, where each dictionary corresponds to a row (i.e., a document or recipe) in the dataframe:

recipes = df.to_dict('records')
print(f"Number of documents: {len(recipes)}")

At this point, we have our dataset ready to index in Elasticsearch. However, considering the size of the dataset, it is advisable to use the bulk indexing feature for efficient data ingestion. This will be covered in the next section.

Bulk indexing the data

Let’s look into the step-by-step process of bulk indexing your dataset in Elasticsearch.

Defining the preprocessing pipeline

Before we proceed to bulk indexing, we need to set up a pipeline to preprocess the documents. Here, we will use the elser-v1-recipes pipeline, which utilizes the ELSER model for semantic indexing. The pipeline is defined as follows:

[
 {
   "inference": {
     "model_id": ".elser_model_1",
     "target_field": "ml",
     "field_map": {
       "ingredient": "text_field"
     },
     "inference_config": {
       "text_expansion": {
         "results_field": "tokens"
       }
     }
   }
 }
]

The pipeline includes an inference processor that uses the ELSER pre-trained model to perform semantic indexing. It maps the ingredient field from the recipe data to the text_field object of the ELSER model. The output (the expanded tokens from the ELSER model) is stored in the tokens field under the ml field in the document.

Creating a bulk indexing sequence

Given the size of the Allrecipes.com dataset, it’s impractical to index each document individually. Instead, we can utilize Elasticsearch’s bulk API, which allows us to index multiple documents in a single request. First, we need to generate a list of dictionaries, where each dictionary corresponds to a bulk index operation:

bulk_index_body = []
for index, recipe in enumerate(recipes):
   document = {
       "_index": "recipes",
       "pipeline": "elser-v1-recipes",
       "_source": recipe
   }
   bulk_index_body.append(document)

In this loop, we iterate over each recipe (a dictionary) in our recipes list and then construct a new dictionary with the necessary information for the bulk index operation. This dictionary specifies the name of the index where the document will be stored (recipes), the ingest pipeline to be used to process the document (elser-v1-recipes), and the document source itself (recipe).

Performing the bulk index operation

With our bulk_index_body array ready, we can now perform the bulk index operation:

try:
   response = helpers.bulk(es, bulk_index_body, chunk_size=500, 
request_timeout=60*3)
   print ("\nRESPONSE:", response)
except BulkIndexError as e:
   for error in e.errors:
       print(f"Document ID: {error['index']['_id']}")
       print(f"Error Type: {error['index']['error']['type']}")
       print(f"Error Reason: {error['index']['error']['reason']}")

We use the helpers.bulk() function from the Elasticsearch library to provide our Elasticsearch connection (es)—the bulk_index_body array we just created—with a chunk_size value of 500 (which specifies that we want to send 500 documents per request) and a request_timeout value of 180 seconds, which specifies that we want to allow each request to take up to 3 minutes before timing out because the indexing could take a long time with ELSER.

The helpers.bulk() function will return a response indicating the number of operations attempted and the number of errors, if any.

If any errors occur during the bulk index operation, these will be raised as BulkIndexError. We can catch this exception and iterate over its errors attribute to get information about each individual error, including the ID of the document that caused the error, the type of error, and the reason for it.

At the end of this process, you will have successfully indexed your entire Allrecipes.com dataset in Elasticsearch, ready for it to be retrieved and processed by your RAG-enhanced CookBot.

Conclusion

In closing, the infusion of RAG, ELSER, and RRF technologies into CookBot elevates culinary exploration. With Elasticsearch indexing and the Allrecipes.com dataset, CookBot transcends traditional kitchen boundaries, offering personalized, context-aware assistance. This journey signifies the convergence of cutting-edge AI and the rich tapestry of culinary possibilities. As CookBot orchestrates flavor symphonies, the future of cooking is redefined, promising a delightful harmony of technology and gastronomy for every user. Embrace the evolution—where CookBot's intelligence transforms mere ingredients into a canvas for culinary innovation.

Author Bio

Bahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.

Jeff Vestal has a rich background spanning over a decade in financial trading firms and extensive experience with Elasticsearch. He offers a unique blend of operational acumen, engineering skills, and machine learning expertise. As a Principal Customer Enterprise Architect, he excels at crafting innovative solutions, leveraging Elasticsearch's advanced search capabilities, machine learning features, and generative AI integrations, adeptly guiding users to transform complex data challenges into actionable insights.