Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
RAG-Driven Generative AI

You're reading from   RAG-Driven Generative AI Build custom retrieval augmented generation pipelines with LlamaIndex, Deep Lake, and Pinecone

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781836200918
Length 334 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Denis Rothman Denis Rothman
Author Profile Icon Denis Rothman
Denis Rothman
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Why Retrieval Augmented Generation? 2. RAG Embedding Vector Stores with Deep Lake and OpenAI FREE CHAPTER 3. Building Index-Based RAG with LlamaIndex, Deep Lake, and OpenAI 4. Multimodal Modular RAG for Drone Technology 5. Boosting RAG Performance with Expert Human Feedback 6. Scaling RAG Bank Customer Data with Pinecone 7. Building Scalable Knowledge-Graph-Based RAG with Wikipedia API and LlamaIndex 8. Dynamic RAG with Chroma and Hugging Face Llama 9. Empowering AI Models: Fine-Tuning RAG Data and Human Feedback 10. RAG for Video Stock Production with Pinecone and OpenAI 11. Other Books You May Enjoy
12. Index
Appendix

Pipeline 1: Generator and Commentator

A revolution is on its way in computer vision with automated video generation and analysis. We will introduce the Generator AI agent with Sora in The AI-generated video dataset section. We will explore how OpenAI Sora was used to generate the videos for this chapter with a text-to-video diffusion transformer. The technology itself is something we have expected and experienced to some extent in professional film-making environments. However, the novelty relies on the fact that the software has become mainstream in a few clicks, with inVideo, for example!

In the The Generator and the Commentator section, we will extend the scope of the Generator to collecting and processing the AI-generated videos. The Generator splits the videos into frames and works with the Commentator, an OpenAI LLM, to produce comments on samples of video frames.

The Generator’s task begins by producing the AI-generated video dataset.

The AI-generated video dataset

The first AI agent in this project is a text-to-video diffusion transformer model that generates a video dataset we will implement. The videos for this chapter were specifically generated by Sora, a text-to-video AI model released by OpenAI in February 2024. You can access Sora to view public AI-generated videos and create your own at https://ai.invideo.io/. AI-generated videos also allow for free videos with flexible copyright terms that you can check out at https://invideo.io/terms-and-conditions/.

Once you have gone through this chapter, you can also create your own video dataset with any source of videos, such as smartphones, video stocks, and social media.

AI-generated videos enhance the speed of creating video datasets. Teams do not have to spend time finding videos that fit their needs. They can obtain a video quickly with a prompt that can be an idea expressed in a few words. AI-generated videos represent a huge leap into the future of AI applications. Sora’s potential applies to many industries, including filmmaking, education, and marketing. Its ability to generate nuanced video content from simple text prompts opens new avenues for creative and educational outputs.

Although AI-generated videos (and, in particular, diffusion transformers) have changed the way we create world simulations, this represents a risk for jobs in many areas, such as filmmaking. The risk of deep fakes and misinformation is real. At a personal level, we must take ethical considerations into account when we implement Generative AI in a project, thus producing constructive, ethical, and realistic content.

Let’s see how a diffusion transformer can produce realistic content.

How does a diffusion transformer work?

At the core of Sora, as described by Liu et al., 2024 (see the References section), is a diffusion transformer model that operates between an encoder and a decoder. It uses user text input to guide the content generation, associating it with patches from the encoder. The model iteratively refines these noisy latent representations, enhancing their clarity and coherence. Finally, the refined data is passed to the decoder to reconstruct high-fidelity video frames. The technology involved includes vision transformers such as CLIP and LLMs such as GPT-4, as well as other components OpenAI continually includes in its vision model releases.

The encoder and decoder are integral components of the overall diffusion model, as illustrated in Figure 10.3. They both play a critical role in the workflow of the transformer diffusion model:

  • Encoder: The encoder’s primary function is to compress input data, such as images or videos, into a lower-dimensional latent space. The encoder thus transforms high-dimensional visual data into a compact representation while preserving crucial information. A lower-dimensional latent space obtained is a compressed representation of high-dimensional data, retaining essential features while reducing complexity. For example, a high-resolution image (1024x1024 pixels, 3 color channels) can be compressed by an encoder into a vector of 1000 values, capturing key details like shape and texture. This makes processing and manipulating images more efficient.
  • Decoder: The decoder reconstructs the original data from the latent representation produced by the encoder. It performs the encoder’s reverse operation, transforming the low-dimensional latent space back into high-dimensional pixel space, thus generating the final output, such as images or videos.

A diagram of a workflow

Description automatically generated

Figure 10.3: The encoding and decoding workflow of video diffusion models

The process of a diffusion transformer model goes through five main steps, as you can observe in the previous figure:

  1. The visual encoder transforms datasets of images into a lower-dimensional latent space.
  2. The visual encoder splits the lower-dimensional latent space into patches that are like words in a sentence.
  3. The diffusion transformer associates user text input with its dictionary of patches.
  4. The diffusion transformer iteratively refines noisy image representations generated to produce coherent frames.
  5. The visual decoder reconstructs the refined latent representations into high-fidelity video frames that align with the user’s instructions.

The video frames can then be played in a sequence. Every second of a video contains a set of frames. We will be deconstructing the AI-generated videos into frames and commenting on these frames later. But for now, we will analyze the video dataset produced by the diffusion transformer.

Analyzing the diffusion transformer model video dataset

Open the Videos_dataset_visualization.ipynb notebook on GitHub. Hopefully, you have installed the environment as described earlier in this chapter. We will move on to writing the download and display functions we need.

Video download and display functions

The three main functions each use filename (the name of the video file) as an argument. The three main functions download and display videos, and display frames in the videos.

download_video downloads one video at a time from the GitHub dataset, calling the download function defined in the GitHub subsection of The environment:

# downloading file from GitHub
def download_video(filename):
  # Define your variables
  directory = "Chapter10/videos"
  filename = file_name
  download(directory, filename)

display_video(file_name) displays the video file downloaded by first encoding in base64, a binary-to-text encoding scheme that represents binary data in ASCII string format. Then, the encoded video is displayed in HTML:

# Open the file in binary mode
def display_video(file_name):
    with open(file_name, 'rb') as file:
      video_data = file.read()
  # Encode the video file as base64
  video_url = b64encode(video_data).decode()
  # Create an HTML string with the embedded video
  html = f'''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{video_url}" type="video/mp4">
  Your browser does not support the video tag.
  </video>
  '''
  # Display the video
  HTML(html)
  # Return the HTML object
  return HTML(html)

display_video_frame takes file_name, frame_number, and size (the image size to display) as arguments to display a frame in the video. The function first opens the video file and then extracts the frame number set by frame_number:

def display_video_frame(file_name, frame_number, size):
    # Open the video file
    cap = cv2.VideoCapture(file_name)
    # Move to the frame_number
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
    # Read the frame
    success, frame = cap.read()
    if not success:
      return "Failed to grab frame"

The function converts the file from the BGR (blue, green, and red) to the RGB (red, green, and blue) channel, converts it to PIL, an image array (such as one handled by OpenCV), and resizes it with the size parameters:

# Convert the color from BGR to RGB
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    # Convert to PIL image and resize
    img = Image.fromarray(frame)
    img = img.resize(size, Image.LANCZOS)  # Resize image to specified size

Finally, the function encodes the image in string format with base64 and displays it in HTML:

    # Convert the PIL image to a base64 string to embed in HTML
    buffered = BytesIO()
    img.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    # Create an HTML string with the embedded image
    html_str = f'''
    <img src="data:image/jpeg;base64,{img_str}" width="{size[0]}" height="{size[1]}">
    '''
    # Display the image
    display(HTML(html_str))
    # Return the HTML object for further use if needed
    return HTML(html_str)

Once the environment is installed and the video processing functions are ready, we will display the introduction video.

Introduction video (with audio)

The following cells download and display the introduction video using the functions we created in the previous section. A video file is selected and downloaded with the download_video function:

# select file
print("Collecting video")
file_name="AI_Professor_Introduces_New_Course.mp4"
#file_name = "AI_Professor_Introduces_New_Course.mp4" # Enter the name of the video file to process here
print(f"Video: {file_name}")
# Downloading video
print("Downloading video: downloading from GitHub")
download_video(file_name)

The output confirms the selection and download status:

Collecting video
Video: AI_Professor_Introduces_New_Course.mp4
Downloading video: downloading from GitHub
Downloaded 'AI_Professor_Introduces_New_Course.mp4' successfully.

We can choose to display only a single frame of the video as a thumbnail with the display_video_frame function by providing the file name, the frame number, and the image size to display. The program will first compute frame_count (the number of frames in the video), frame_rate (the number of frames per second), and video_duration (the duration of the video). Then, it will make sure frame_number (the frame we want to display) doesn’t exceed frame_count. Finally, it displays the frame as a thumbnail:

print("Displaying a frame of video: ",file_name)
video_capture = cv2.VideoCapture(file_name)
frame_count = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))
print(f'Total number of frames: {frame_count}')
frame_rate = video_capture.get(cv2.CAP_PROP_FPS)
print(f"Frame rate: {frame_rate}")
video_duration = frame_count / frame_rate
print(f"Video duration: {video_duration:.2f} seconds")
video_capture.release()
print(f'Total number of frames: {frame_count}')
frame_number=5
if frame_number > frame_count and frame_count>0:
  frame_number = 1
display_video_frame(file_name, frame_number, size=(135, 90));

Here, frame_number is set to 5, but you can choose another value. The output shows the information on the video and the thumbnail:

Displaying a frame of video:  /content/AI_Professor_Introduces_New_Course.mp4
Total number of frames: 340
A person taking a picture of a person sitting in a church

Description automatically generated

We can also display the full video if needed:

#print("Displaying video: ",file_name)
display_video(file_name)

The video will be displayed and can be played with the audio track:

A person sitting in a chair

Description automatically generated

Figure 10.4: AI-generated video

Let’s describe and display AI-generated videos in the /videos directory of this chapter’s GitHub directory. You can host this dataset in another location and scale it to the volume that meets your project’s specifications. The educational video dataset of this chapter is listed in lfiles:

lfiles = [
    "jogging1.mp4",
    "jogging2.mp4",
    "skiing1.mp4",
    …
    "female_player_after_scoring.mp4",
    "football1.mp4",
    "football2.mp4",
    "hockey1.mp4"
]

We can now move on and display any video we wish.

Displaying thumbnails and videos in the AI-generated dataset

This section is a generalization of the Introduction video (with audio) section. This time, instead of downloading one video, it downloads all the videos and displays the thumbnails of all the videos. You can then select a video in the list and display it.

The program first collects the video dataset:

for i in range(lf):
  file_name=lfiles[i]
  print("Collecting video",file_name)
  print("Downloading video",file_name)
  download_video(file_name)

The output shows the file names of the downloaded videos:

Collecting video jogging1.mp4
Downloading video jogging1.mp4
Downloaded 'jogging1.mp4' successfully.
Collecting video jogging2.mp4…

The program calculates the number of videos in the list:

lf=len(lfiles)

The program goes through the list and displays the information for each video and displays its thumbnail:

for i in range(lf):
  file_name=lfiles[i]
  video_capture.release()
  display_video_frame(file_name, frame_number=5, size=(100, 110))

The information on the video and its thumbnail is displayed:

Displaying a frame of video:  skiing1.mp4
Total number of frames: 58
Frame rate: 30.0
Video duration: 1.93 seconds
A group of people skiing down a slope

Description automatically generated

You can select a video in the list and display it:

file_name="football1.mp4" # Enter the name of the video file to process here
#print("Displaying video: ",file_name)
display_video(file_name)

You can click on the video and watch it:

A person in a football uniform pointing at a football

Description automatically generated

Figure 10.5: Video of a football player

We have explored how the AI-generated videos were produced and visualized the dataset. We are now ready to build the Generator and the Commentator.

The Generator and the Commentator

The dataset of AI-generated videos is ready. We will now build the Generator and the Commentator, which processes one video at a time, making scaling seamless. An indefinite number of videos can be processed one at a time, requiring only a CPU and limited disk space. The Generator and the Commentator work together, as shown in Figure 10.8. These AI agents will produce raw videos from text and then split them into frames that they will comment on:

A diagram of a sports game

Description automatically generated with medium confidence

Figure 10.6: The Generator and the Commentator work together to comment on video frames

The Generator and the Commentator produce the commented frames required in four main steps that we will build in Python:

  1. The Generator generates the text-to-video inVideo video dataset based on the video production team’s text input. In this chapter, it is a dataset of sports videos.
  2. The Generator runs a scaled process by selecting one video at a time.
  3. The Generator splits the video into frames (images)
  4. The Commentator samples frames (images) and comments on them with an OpenAI LLM model. Each commented frame is saved with:
    • Unique ID
    • Comment
    • Frame
    • Video file name

We will now build the Generator and the Commentator in Python, starting with the AI-generated videos. Open Pipeline_1_The_Generator_and_the_Commentator.ipynb in the chapter’s GitHub directory. See the The environment section of this chapter for a description of the Installing the environment section of this notebook. The process of going from a video to comments on a sample of frames only takes three straightforward steps in Python:

  1. Displaying the video
  2. Splitting the video into frames
  3. Commenting on the frames

We will define functions for each step and call them in the Pipeline-1 Controller section of the program. The first step is to define a function to display a video.

Step 1. Displaying the video

The download function is in the GitHub subsection of the Installing the environment section of this notebook. It will be called by the Vector Store Administrator-Pipeline 1 in the Administrator-Pipeline 1 section of this notebook on GitHub.

display_video(file_name) is the same as defined in the previous section, The AI-generated video dataset:

# Open the file in binary mode
def display_video(file_name):
  with open(file_name, 'rb') as file:
      video_data = file.read()
…
  # Return the HTML object
  return HTML(html)

The downloaded video will now be split into frames.

Step 2. Splitting video into frames

The split_file(file_name) function extracts frames from a video, as in the previous section, The AI-generated video dataset. However, in this case, we will expand the function to save frames as JPEG files:

def split_file(file_name):
  video_path = file_name
  cap = cv2.VideoCapture(video_path)
  frame_number = 0
  while cap.isOpened():
      ret, frame = cap.read()
      if not ret:
          break
      cv2.imwrite(f"frame_{frame_number}.jpg", frame)
      frame_number += 1
      print(f"Frame {frame_number} saved.")
  cap.release()

We have split the video into frames and saved them as JPEG images with their respective frame number, frame_number. The Generator’s job finishes here and the Commentator now takes over.

Step 3. Commenting on the frames

The Generator has gone from text-to-video to splitting the video and saving the frames as JPEG frames. The Commentator now takes over to comment on the frames with three functions:

  • generate_openai_comments(filename) asks the GPT-4 series vision model to analyze a frame and produce a response that contains a comment describing the frame
  • generate_comment(response_data) extracts the comment from the response
  • save_comment(comment, frame_number, file_name) saves the comment

We need to build the Commentator’s extraction function first:

def generate_comment(response_data):
    """Extract relevant information from GPT-4 Vision response."""
    try:
        caption = response_data.choices[0].message.content
        return caption
    except (KeyError, AttributeError):
        print("Error extracting caption from response.")
        return "No caption available."

We then write a function to save the extracted comment in a CSV file that bears the same name as the video file:

def save_comment(comment, frame_number, file_name):
    """Save the comment to a text file formatted for seamless loading into a pandas DataFrame."""
    # Append .csv to the provided file name to create the complete file name
    path = f"{file_name}.csv"
    # Check if the file exists to determine if we need to write headers
    write_header = not os.path.exists(path)
    with open(path, 'a', newline='') as f:
        writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        if write_header:
            writer.writerow(['ID', 'FrameNumber', 'Comment', 'FileName'])  # Write the header if the file is being created
        # Generate a unique UUID for each comment
        unique_id = str(uuid.uuid4())
        # Write the data
        writer.writerow([unique_id, frame_number, comment, file_name])

The goal is to save the comment in a format that can directly be upserted to Pinecone:

  • ID: A unique string ID generated with str(uuid.uuid4())
  • FrameNumber: The frame number of the commented JPEG
  • Comment: The comment generated by the OpenAI vision model
  • FileName: The name of the video file

The Commentator’s main function is to generate comments with the OpenAI vision model. However, in this program’s scenario, we will not save all the frames but a sample of the frames. The program first determines the number of frames to process:

def generate_openai_comments(filename):
  video_folder = "/content"  # Folder containing your image frames
  total_frames = len([file for file in os.listdir(video_folder) if file.endswith('.jpg')]

Then, a sample frequency is set that can be modified along with a counter:

  nb=3      # sample frequency
  counter=0 # sample frequency counter

The Commentator will then go through the sampled frames and request a comment:

  for frame_number in range(total_frames):
      counter+=1 # sampler
      if counter==nb and counter<total_frames:
        counter=0
        print(f"Analyzing frame {frame_number}...")
        image_path = os.path.join(video_folder, f"frame_{frame_number}.jpg")
        try:
            with open(image_path, "rb") as image_file:
                image_data = image_file.read()
                response = openai.ChatCompletion.create(
                    model="gpt-4-vision-preview",

The message is very concise: "What is happening in this image?" The message also includes the image of the frame:

                    messages=[
                        {
                            "role": "user",
                            "content": [
                                {"type": "text", "text": "What is happening in this image?"},
                                {
                                    "type": "image",
                                    "image_url": f"data:image/jpeg;base64,{base64.b64encode(image_data).decode('utf-8')}"
                                },
                            ],
                       }
                    ],
                    max_tokens=150,
               )

Once a response is returned, the generate_comment and save_comment functions are called to extract and save the comment, respectively:

            comment = generate_comment(response)
            save_comment(comment, frame_number,file_name)
        except FileNotFoundError:
            print(f"Error: Frame {frame_number} not found.")
        except Exception as e:
            print(f"Unexpected error: {e}")

The final function we require of the Commentator is to display the comments by loading the CSV file produced in a pandas DataFrame:

# Read the video comments file into a pandas DataFrame
def display_comments(file_name):
  # Append .csv to the provided file name to create the complete file name
  path = f"{file_name}.csv"
  df = pd.read_csv(path)
  return df

The function returns the DataFrame with the comments. An administrator controls Pipeline 1, the Generator, and the Commentator.

Pipeline 1 controller

The controller runs jobs for the preceding three steps of the Generator and the Commentator. It begins with Step 1, which includes selecting a video, downloading it, and displaying it. In an automated pipeline, these functions can be separated. For example, a script would iterate through a list of videos, automatically select each one, and encapsulate the controller functions. In this case, in a pre-production and educational context, we will collect, download, and display the videos one by one:

session_time = time.time()  # Start timing before the request
# Step 1: Displaying the video
# select file
print("Step 1: Collecting video")
file_name = "skiing1.mp4" # Enter the name of the video file to process here
print(f"Video: {file_name}")
# Downloading video
print("Step 1:downloading from GitHub")
directory = "Chapter10/videos"
download(directory,file_name)
# Displaying video
print("Step 1:displaying video")
display_video(file_name)

The controller then splits the video into frames and comments on the frames of the video:

# Step 2.Splitting video
print("Step 2: Splitting the video into frames")
split_file(file_name)

The controller activates the Generator to produce comments on frames of the video:

# Step 3.Commenting on the video frames
print("Step 3: Commenting on the frames")
start_time = time.time()  # Start timing before the request
generate_openai_comments(file_name)
response_time = time.time() - session_time  # Measure response time

The response time is measured as well. The controller then adds additional outputs to display the number of frames, the comments, the content generation time, and the total controller processing times:

# number of frames
video_folder = "/content"  # Folder containing your image frames
total_frames = len([file for file in os.listdir(video_folder) if file.endswith('.jpg')])
print(total_frames)
# Display comments
print("Commenting video: displaying comments")
display_comments(file_name)
total_time = time.time() - start_time  # Start timing before the request
print(f"Response Time: {response_time:.2f} seconds")  # Print response time
print(f"Total Time: {total_time:.2f} seconds")  # Print response time

The controller has completed its task of producing content. However, depending on your project, you can introduce dynamic RAG for some or all the videos. If you need this functionality, you can apply the process described in Chapter 5, Boosting RAG Performance with Expert Human Feedback, to the Commentator’s outputs, including the cosine similarity quality control metrics, as we will in the Pipeline 3: The Video Expert section of this chapter.

The controller can also save the comments and frames.

Saving comments

To save the comments, set save=True. To save the frames, set save_frames=True. Set both values to False if you just want to run the program and view the outputs, but, in our case, we will set them as True:

# Ensure the file exists and double checking before saving the comments
save=True        # double checking before saving the comments
save_frames=True # double checking before saving the frames

The comment is saved in CSV format in cpath and contains the file name with the .csv extension and in the location of your choice. In this case, the files are saved on Google Drive (make sure the path exists):

# Save comments
if save==True:  # double checking before saving the comments
  # Append .csv to the provided file name to create the complete file name
  cpath = f"{file_name}.csv"
  if os.path.exists(cpath):
      # Use the Python variable 'path' correctly in the shell command
      !cp {cpath} /content/drive/MyDrive/files/comments/{cpath}
      print(f"File {cpath} copied successfully.")
  else:
      print(f"No such file: {cpath}")

The output confirms that a file is saved:

File alpinist1.mp4.csv copied successfully.

The frames are saved in a root name direction, for which we remove the extension with root_name = root_name + extension.strip('.'):

# Save frames
import shutil
if save_frames==True:
  # Extract the root name by removing the extension
  root_name, extension = os.path.splitext(file_name)
  # This removes the period from the extension
  root_name = root_name + extension.strip('.')
  # Path where you want to copy the jpg files
  target_directory = f'/content/drive/MyDrive/files/comments/{root_name}'
  # Ensure the directory exists
  os.makedirs(target_directory, exist_ok=True)
  # Assume your jpg files are in the current directory. Modify this as needed
  source_directory = os.getcwd()  # or specify a different directory
  # List all jpg files in the source directory
  for file in os.listdir(source_directory):
      if file.endswith('.jpg'):
        shutil.copy(os.path.join(source_directory, file), target_directory)

The output is a directory with all the frames generated in it. We should delete the files if the controller runs in a loop over all the videos in a single session.

Deleting files

To delete the files, just set delf=True:

delf=False  # double checking before deleting the files in a session
if delf==True:
  !rm -f *.mp4 # video files
  !rm -f *.jpg # frames
  !rm -f *.csv # comments

You can now process an unlimited number of videos one by one and scale to whatever size you wish, as long as you have disk space and a CPU!

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image