Pipeline 1: Generator and Commentator
A revolution is on its way in computer vision with automated video generation and analysis. We will introduce the Generator AI agent with Sora in The AI-generated video dataset section. We will explore how OpenAI Sora was used to generate the videos for this chapter with a text-to-video diffusion transformer. The technology itself is something we have expected and experienced to some extent in professional film-making environments. However, the novelty relies on the fact that the software has become mainstream in a few clicks, with inVideo, for example!
In the The Generator and the Commentator section, we will extend the scope of the Generator to collecting and processing the AI-generated videos. The Generator splits the videos into frames and works with the Commentator, an OpenAI LLM, to produce comments on samples of video frames.
The Generator’s task begins by producing the AI-generated video dataset.
The AI-generated video dataset
The first AI agent in this project is a text-to-video diffusion transformer model that generates a video dataset we will implement. The videos for this chapter were specifically generated by Sora, a text-to-video AI model released by OpenAI in February 2024. You can access Sora to view public AI-generated videos and create your own at https://ai.invideo.io/. AI-generated videos also allow for free videos with flexible copyright terms that you can check out at https://invideo.io/terms-and-conditions/.
Once you have gone through this chapter, you can also create your own video dataset with any source of videos, such as smartphones, video stocks, and social media.
AI-generated videos enhance the speed of creating video datasets. Teams do not have to spend time finding videos that fit their needs. They can obtain a video quickly with a prompt that can be an idea expressed in a few words. AI-generated videos represent a huge leap into the future of AI applications. Sora’s potential applies to many industries, including filmmaking, education, and marketing. Its ability to generate nuanced video content from simple text prompts opens new avenues for creative and educational outputs.
Although AI-generated videos (and, in particular, diffusion transformers) have changed the way we create world simulations, this represents a risk for jobs in many areas, such as filmmaking. The risk of deep fakes and misinformation is real. At a personal level, we must take ethical considerations into account when we implement Generative AI in a project, thus producing constructive, ethical, and realistic content.
Let’s see how a diffusion transformer can produce realistic content.
How does a diffusion transformer work?
At the core of Sora, as described by Liu et al., 2024 (see the References section), is a diffusion transformer model that operates between an encoder and a decoder. It uses user text input to guide the content generation, associating it with patches from the encoder. The model iteratively refines these noisy latent representations, enhancing their clarity and coherence. Finally, the refined data is passed to the decoder to reconstruct high-fidelity video frames. The technology involved includes vision transformers such as CLIP and LLMs such as GPT-4, as well as other components OpenAI continually includes in its vision model releases.
The encoder and decoder are integral components of the overall diffusion model, as illustrated in Figure 10.3. They both play a critical role in the workflow of the transformer diffusion model:
- Encoder: The encoder’s primary function is to compress input data, such as images or videos, into a lower-dimensional latent space. The encoder thus transforms high-dimensional visual data into a compact representation while preserving crucial information. A lower-dimensional latent space obtained is a compressed representation of high-dimensional data, retaining essential features while reducing complexity. For example, a high-resolution image (1024x1024 pixels, 3 color channels) can be compressed by an encoder into a vector of 1000 values, capturing key details like shape and texture. This makes processing and manipulating images more efficient.
- Decoder: The decoder reconstructs the original data from the latent representation produced by the encoder. It performs the encoder’s reverse operation, transforming the low-dimensional latent space back into high-dimensional pixel space, thus generating the final output, such as images or videos.
Figure 10.3: The encoding and decoding workflow of video diffusion models
The process of a diffusion transformer model goes through five main steps, as you can observe in the previous figure:
- The visual encoder transforms datasets of images into a lower-dimensional latent space.
- The visual encoder splits the lower-dimensional latent space into patches that are like words in a sentence.
- The diffusion transformer associates user text input with its dictionary of patches.
- The diffusion transformer iteratively refines noisy image representations generated to produce coherent frames.
- The visual decoder reconstructs the refined latent representations into high-fidelity video frames that align with the user’s instructions.
The video frames can then be played in a sequence. Every second of a video contains a set of frames. We will be deconstructing the AI-generated videos into frames and commenting on these frames later. But for now, we will analyze the video dataset produced by the diffusion transformer.
Analyzing the diffusion transformer model video dataset
Open the Videos_dataset_visualization.ipynb
notebook on GitHub. Hopefully, you have installed the environment as described earlier in this chapter. We will move on to writing the download and display functions we need.
Video download and display functions
The three main functions each use filename
(the name of the video file) as an argument. The three main functions download and display videos, and display frames in the videos.
download_video
downloads one video at a time from the GitHub dataset, calling the download
function defined in the GitHub subsection of The environment:
# downloading file from GitHub
def download_video(filename):
# Define your variables
directory = "Chapter10/videos"
filename = file_name
download(directory, filename)
display_video(file_name)
displays the video file downloaded by first encoding in base64
, a binary-to-text encoding scheme that represents binary data in ASCII string format. Then, the encoded video is displayed in HTML:
# Open the file in binary mode
def display_video(file_name):
with open(file_name, 'rb') as file:
video_data = file.read()
# Encode the video file as base64
video_url = b64encode(video_data).decode()
# Create an HTML string with the embedded video
html = f'''
<video width="640" height="480" controls>
<source src="data:video/mp4;base64,{video_url}" type="video/mp4">
Your browser does not support the video tag.
</video>
'''
# Display the video
HTML(html)
# Return the HTML object
return HTML(html)
display_video_frame
takes file_name
, frame_number
, and size
(the image size to display) as arguments to display a frame in the video. The function first opens the video file and then extracts the frame number set by frame_number
:
def display_video_frame(file_name, frame_number, size):
# Open the video file
cap = cv2.VideoCapture(file_name)
# Move to the frame_number
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
# Read the frame
success, frame = cap.read()
if not success:
return "Failed to grab frame"
The function converts the file from the BGR (blue, green, and red) to the RGB (red, green, and blue) channel, converts it to PIL, an image array (such as one handled by OpenCV), and resizes it with the size
parameters:
# Convert the color from BGR to RGB
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Convert to PIL image and resize
img = Image.fromarray(frame)
img = img.resize(size, Image.LANCZOS) # Resize image to specified size
Finally, the function encodes the image in string format with base64
and displays it in HTML:
# Convert the PIL image to a base64 string to embed in HTML
buffered = BytesIO()
img.save(buffered, format="JPEG")
img_str = base64.b64encode(buffered.getvalue()).decode()
# Create an HTML string with the embedded image
html_str = f'''
<img src="data:image/jpeg;base64,{img_str}" width="{size[0]}" height="{size[1]}">
'''
# Display the image
display(HTML(html_str))
# Return the HTML object for further use if needed
return HTML(html_str)
Once the environment is installed and the video processing functions are ready, we will display the introduction video.
Introduction video (with audio)
The following cells download and display the introduction video using the functions we created in the previous section. A video file is selected and downloaded with the download_video
function:
# select file
print("Collecting video")
file_name="AI_Professor_Introduces_New_Course.mp4"
#file_name = "AI_Professor_Introduces_New_Course.mp4" # Enter the name of the video file to process here
print(f"Video: {file_name}")
# Downloading video
print("Downloading video: downloading from GitHub")
download_video(file_name)
The output confirms the selection and download status:
Collecting video
Video: AI_Professor_Introduces_New_Course.mp4
Downloading video: downloading from GitHub
Downloaded 'AI_Professor_Introduces_New_Course.mp4' successfully.
We can choose to display only a single frame of the video as a thumbnail with the display_video_frame
function by providing the file name, the frame number, and the image size to display. The program will first compute frame_count
(the number of frames in the video), frame_rate
(the number of frames per second), and video_duration
(the duration of the video). Then, it will make sure frame_number
(the frame we want to display) doesn’t exceed frame_count
. Finally, it displays the frame as a thumbnail:
print("Displaying a frame of video: ",file_name)
video_capture = cv2.VideoCapture(file_name)
frame_count = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))
print(f'Total number of frames: {frame_count}')
frame_rate = video_capture.get(cv2.CAP_PROP_FPS)
print(f"Frame rate: {frame_rate}")
video_duration = frame_count / frame_rate
print(f"Video duration: {video_duration:.2f} seconds")
video_capture.release()
print(f'Total number of frames: {frame_count}')
frame_number=5
if frame_number > frame_count and frame_count>0:
frame_number = 1
display_video_frame(file_name, frame_number, size=(135, 90));
Here, frame_number
is set to 5
, but you can choose another value. The output shows the information on the video and the thumbnail:
Displaying a frame of video: /content/AI_Professor_Introduces_New_Course.mp4
Total number of frames: 340
We can also display the full video if needed:
#print("Displaying video: ",file_name)
display_video(file_name)
The video will be displayed and can be played with the audio track:
Figure 10.4: AI-generated video
Let’s describe and display AI-generated videos in the /videos
directory of this chapter’s GitHub directory. You can host this dataset in another location and scale it to the volume that meets your project’s specifications. The educational video dataset of this chapter is listed in lfiles
:
lfiles = [
"jogging1.mp4",
"jogging2.mp4",
"skiing1.mp4",
…
"female_player_after_scoring.mp4",
"football1.mp4",
"football2.mp4",
"hockey1.mp4"
]
We can now move on and display any video we wish.
Displaying thumbnails and videos in the AI-generated dataset
This section is a generalization of the Introduction video (with audio) section. This time, instead of downloading one video, it downloads all the videos and displays the thumbnails of all the videos. You can then select a video in the list and display it.
The program first collects the video dataset:
for i in range(lf):
file_name=lfiles[i]
print("Collecting video",file_name)
print("Downloading video",file_name)
download_video(file_name)
The output shows the file names of the downloaded videos:
Collecting video jogging1.mp4
Downloading video jogging1.mp4
Downloaded 'jogging1.mp4' successfully.
Collecting video jogging2.mp4…
The program calculates the number of videos in the list:
lf=len(lfiles)
The program goes through the list and displays the information for each video and displays its thumbnail:
for i in range(lf):
file_name=lfiles[i]
video_capture.release()
display_video_frame(file_name, frame_number=5, size=(100, 110))
The information on the video and its thumbnail is displayed:
Displaying a frame of video: skiing1.mp4
Total number of frames: 58
Frame rate: 30.0
Video duration: 1.93 seconds
You can select a video in the list and display it:
file_name="football1.mp4" # Enter the name of the video file to process here
#print("Displaying video: ",file_name)
display_video(file_name)
You can click on the video and watch it:
Figure 10.5: Video of a football player
We have explored how the AI-generated videos were produced and visualized the dataset. We are now ready to build the Generator and the Commentator.
The Generator and the Commentator
The dataset of AI-generated videos is ready. We will now build the Generator and the Commentator, which processes one video at a time, making scaling seamless. An indefinite number of videos can be processed one at a time, requiring only a CPU and limited disk space. The Generator and the Commentator work together, as shown in Figure 10.8. These AI agents will produce raw videos from text and then split them into frames that they will comment on:
Figure 10.6: The Generator and the Commentator work together to comment on video frames
The Generator and the Commentator produce the commented frames required in four main steps that we will build in Python:
- The Generator generates the text-to-video inVideo video dataset based on the video production team’s text input. In this chapter, it is a dataset of sports videos.
- The Generator runs a scaled process by selecting one video at a time.
- The Generator splits the video into frames (images)
- The Commentator samples frames (images) and comments on them with an OpenAI LLM model. Each commented frame is saved with:
- Unique ID
- Comment
- Frame
- Video file name
We will now build the Generator and the Commentator in Python, starting with the AI-generated videos. Open Pipeline_1_The_Generator_and_the_Commentator.ipynb
in the chapter’s GitHub directory. See the The environment section of this chapter for a description of the Installing the environment section of this notebook. The process of going from a video to comments on a sample of frames only takes three straightforward steps in Python:
- Displaying the video
- Splitting the video into frames
- Commenting on the frames
We will define functions for each step and call them in the Pipeline-1 Controller
section of the program. The first step is to define a function to display a video.
Step 1. Displaying the video
The download
function is in the GitHub subsection of the Installing the environment section of this notebook. It will be called by the Vector Store Administrator-Pipeline 1 in the Administrator-Pipeline 1 section of this notebook on GitHub.
display_video(file_name)
is the same as defined in the previous section, The AI-generated video dataset:
# Open the file in binary mode
def display_video(file_name):
with open(file_name, 'rb') as file:
video_data = file.read()
…
# Return the HTML object
return HTML(html)
The downloaded video will now be split into frames.
Step 2. Splitting video into frames
The split_file(file_name)
function extracts frames from a video, as in the previous section, The AI-generated video dataset. However, in this case, we will expand the function to save frames as JPEG files:
def split_file(file_name):
video_path = file_name
cap = cv2.VideoCapture(video_path)
frame_number = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
cv2.imwrite(f"frame_{frame_number}.jpg", frame)
frame_number += 1
print(f"Frame {frame_number} saved.")
cap.release()
We have split the video into frames and saved them as JPEG images with their respective frame number, frame_number
. The Generator’s job finishes here and the Commentator now takes over.
Step 3. Commenting on the frames
The Generator has gone from text-to-video to splitting the video and saving the frames as JPEG frames. The Commentator now takes over to comment on the frames with three functions:
generate_openai_comments(filename)
asks the GPT-4 series vision model to analyze a frame and produce a response that contains a comment describing the framegenerate_comment(response_data)
extracts the comment from the responsesave_comment(comment, frame_number, file_name)
saves the comment
We need to build the Commentator’s extraction function first:
def generate_comment(response_data):
"""Extract relevant information from GPT-4 Vision response."""
try:
caption = response_data.choices[0].message.content
return caption
except (KeyError, AttributeError):
print("Error extracting caption from response.")
return "No caption available."
We then write a function to save the extracted comment in a CSV file that bears the same name as the video file:
def save_comment(comment, frame_number, file_name):
"""Save the comment to a text file formatted for seamless loading into a pandas DataFrame."""
# Append .csv to the provided file name to create the complete file name
path = f"{file_name}.csv"
# Check if the file exists to determine if we need to write headers
write_header = not os.path.exists(path)
with open(path, 'a', newline='') as f:
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
if write_header:
writer.writerow(['ID', 'FrameNumber', 'Comment', 'FileName']) # Write the header if the file is being created
# Generate a unique UUID for each comment
unique_id = str(uuid.uuid4())
# Write the data
writer.writerow([unique_id, frame_number, comment, file_name])
The goal is to save the comment in a format that can directly be upserted to Pinecone:
ID
: A unique string ID generated withstr(uuid.uuid4())
FrameNumber
: The frame number of the commented JPEGComment
: The comment generated by the OpenAI vision modelFileName
: The name of the video file
The Commentator’s main function is to generate comments with the OpenAI vision model. However, in this program’s scenario, we will not save all the frames but a sample of the frames. The program first determines the number of frames to process:
def generate_openai_comments(filename):
video_folder = "/content" # Folder containing your image frames
total_frames = len([file for file in os.listdir(video_folder) if file.endswith('.jpg')]
Then, a sample frequency is set that can be modified along with a counter:
nb=3 # sample frequency
counter=0 # sample frequency counter
The Commentator will then go through the sampled frames and request a comment:
for frame_number in range(total_frames):
counter+=1 # sampler
if counter==nb and counter<total_frames:
counter=0
print(f"Analyzing frame {frame_number}...")
image_path = os.path.join(video_folder, f"frame_{frame_number}.jpg")
try:
with open(image_path, "rb") as image_file:
image_data = image_file.read()
response = openai.ChatCompletion.create(
model="gpt-4-vision-preview",
The message is very concise: "What is happening in this image?"
The message also includes the image of the frame:
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is happening in this image?"},
{
"type": "image",
"image_url": f"data:image/jpeg;base64,{base64.b64encode(image_data).decode('utf-8')}"
},
],
}
],
max_tokens=150,
)
Once a response is returned, the generate_comment
and save_comment
functions are called to extract and save the comment, respectively:
comment = generate_comment(response)
save_comment(comment, frame_number,file_name)
except FileNotFoundError:
print(f"Error: Frame {frame_number} not found.")
except Exception as e:
print(f"Unexpected error: {e}")
The final function we require of the Commentator is to display the comments by loading the CSV file produced in a pandas DataFrame:
# Read the video comments file into a pandas DataFrame
def display_comments(file_name):
# Append .csv to the provided file name to create the complete file name
path = f"{file_name}.csv"
df = pd.read_csv(path)
return df
The function returns the DataFrame with the comments. An administrator controls Pipeline 1, the Generator, and the Commentator.
Pipeline 1 controller
The controller runs jobs for the preceding three steps of the Generator and the Commentator. It begins with Step 1, which includes selecting a video, downloading it, and displaying it. In an automated pipeline, these functions can be separated. For example, a script would iterate through a list of videos, automatically select each one, and encapsulate the controller functions. In this case, in a pre-production and educational context, we will collect, download, and display the videos one by one:
session_time = time.time() # Start timing before the request
# Step 1: Displaying the video
# select file
print("Step 1: Collecting video")
file_name = "skiing1.mp4" # Enter the name of the video file to process here
print(f"Video: {file_name}")
# Downloading video
print("Step 1:downloading from GitHub")
directory = "Chapter10/videos"
download(directory,file_name)
# Displaying video
print("Step 1:displaying video")
display_video(file_name)
The controller then splits the video into frames and comments on the frames of the video:
# Step 2.Splitting video
print("Step 2: Splitting the video into frames")
split_file(file_name)
The controller activates the Generator to produce comments on frames of the video:
# Step 3.Commenting on the video frames
print("Step 3: Commenting on the frames")
start_time = time.time() # Start timing before the request
generate_openai_comments(file_name)
response_time = time.time() - session_time # Measure response time
The response time is measured as well. The controller then adds additional outputs to display the number of frames, the comments, the content generation time, and the total controller processing times:
# number of frames
video_folder = "/content" # Folder containing your image frames
total_frames = len([file for file in os.listdir(video_folder) if file.endswith('.jpg')])
print(total_frames)
# Display comments
print("Commenting video: displaying comments")
display_comments(file_name)
total_time = time.time() - start_time # Start timing before the request
print(f"Response Time: {response_time:.2f} seconds") # Print response time
print(f"Total Time: {total_time:.2f} seconds") # Print response time
The controller has completed its task of producing content. However, depending on your project, you can introduce dynamic RAG for some or all the videos. If you need this functionality, you can apply the process described in Chapter 5, Boosting RAG Performance with Expert Human Feedback, to the Commentator’s outputs, including the cosine similarity quality control metrics, as we will in the Pipeline 3: The Video Expert section of this chapter.
The controller can also save the comments and frames.
Saving comments
To save the comments, set save=True
. To save the frames, set save_frames=True
. Set both values to False
if you just want to run the program and view the outputs, but, in our case, we will set them as True
:
# Ensure the file exists and double checking before saving the comments
save=True # double checking before saving the comments
save_frames=True # double checking before saving the frames
The comment is saved in CSV format in cpath
and contains the file name with the .csv
extension and in the location of your choice. In this case, the files are saved on Google Drive (make sure the path exists):
# Save comments
if save==True: # double checking before saving the comments
# Append .csv to the provided file name to create the complete file name
cpath = f"{file_name}.csv"
if os.path.exists(cpath):
# Use the Python variable 'path' correctly in the shell command
!cp {cpath} /content/drive/MyDrive/files/comments/{cpath}
print(f"File {cpath} copied successfully.")
else:
print(f"No such file: {cpath}")
The output confirms that a file is saved:
File alpinist1.mp4.csv copied successfully.
The frames are saved in a root name direction, for which we remove the extension with root_name = root_name + extension.strip('.')
:
# Save frames
import shutil
if save_frames==True:
# Extract the root name by removing the extension
root_name, extension = os.path.splitext(file_name)
# This removes the period from the extension
root_name = root_name + extension.strip('.')
# Path where you want to copy the jpg files
target_directory = f'/content/drive/MyDrive/files/comments/{root_name}'
# Ensure the directory exists
os.makedirs(target_directory, exist_ok=True)
# Assume your jpg files are in the current directory. Modify this as needed
source_directory = os.getcwd() # or specify a different directory
# List all jpg files in the source directory
for file in os.listdir(source_directory):
if file.endswith('.jpg'):
shutil.copy(os.path.join(source_directory, file), target_directory)
The output is a directory with all the frames generated in it. We should delete the files if the controller runs in a loop over all the videos in a single session.
Deleting files
To delete the files, just set delf=True
:
delf=False # double checking before deleting the files in a session
if delf==True:
!rm -f *.mp4 # video files
!rm -f *.jpg # frames
!rm -f *.csv # comments
You can now process an unlimited number of videos one by one and scale to whatever size you wish, as long as you have disk space and a CPU!