Audio Summarize Any YouTube Video with Python, ChatGPT, and AWS

Introduction

Have you ever wished you could just listen to the summary of a long YouTube video instead of watching the whole thing? Well, you're in luck! In this article, I’ll be showcasing a fun little Python project that I’ve been working on, which allows you to do just that.

Don’t get me wrong: YouTube is a great resource for learning about new technologies and keeping you up to date with the latest news. And best of all: it’s free. But sometimes, I tend to lose track of time in the myriad of videos out there, fast forwarding through long talks only to find out in the end that the information I’m looking for is not in the video ☹

Well, if you often find yourself in a similar situation, here’s a potential tool you might like. This little script downloads the audio from a YouTube video, transcribes it, summarizes it using AI and finally generates a new audio file with the summary. And all this magic is done using the OpenAI GPT-3.5-turbo API and some cool AWS services (S3, Transcribe, and Polly). In less than 80 lines of code.

For those who might be unfamiliar with these APIs, here is their purpose in the script:

OpenAI's GPT-3.5-turbo provides programmatic access to the same advanced language model used by ChatGPT. Its purpose in the script is summarizing the transcribed video content.
AWS S3 is a storage service where we temporarily store the audio file from the YouTube video and the transcript. We have to use an S3 bucket because it is required by AWS Transcribe and AWS Polly.
AWS Transcribe is used to convert the audio file into text.
AWS Polly is a service that turns text into lifelike speech. We use it to generate an audio file of the summary.

Logic Diagram

audio-summarize-any-youtube-video-with-python-chatgpt-and-aws-img-0

Disclaimer

Before you start using these services, be aware that both AWS and OpenAI have usage quotas and costs associated with them. Make sure to familiarize yourself with these to avoid any unexpected charges. You’ll probably fall well within the limits of your Amazon account’s free tier unless you start summarizing hundreds of videos.

Also, you might consider adding error handling in the code. To keep it short I’ve skipped it from this demo.

You can download the Python file for this code from GitHub here.

Configuring the APIs

Make sure you store your OpenAI API Key and AWS Credentials in your local environment variables for secure and efficient access. The code works on the assumption that both the OpenAI API keys and AWS credentials are valid and have been already stored on your local environment variables. Alternatively, you can store your AWS ACCESS KEY and SECRET ACCESS KEY in %USERPROFILE%\.aws\credentials

More info on that here: https://docs.aws.amazon.com/sdkref/latest/guide/creds-config-files.html

For the code to function properly make sure the AWS credentials you are using have the following permissions:

AmazonS3FullAccess: This is required to create and delete S3 buckets, upload files to the buckets and delete objects within the buckets.
AmazonTranscribeFullAccess: This is needed to start transcription jobs and get the transcription job results.
AmazonPollyFullAccess: This is necessary to synthesize speech from text.

The most convenient and safe approach to grant the necessary permissions is though the AWS Management Console by attaching the relevant policies to the user or role associated with the credentials.

audio-summarize-any-youtube-video-with-python-chatgpt-and-aws-img-1

Requirements

I’ve used Python v3.11. Make sure you first install all the requirements or update them to the latest version if already installed.

pip install pytube
pip install openai
pip install boto3
pip install requests
pip install python-dotenv

The Code

Let’s break it down snippet by snippet.

Setup and Import Statements

import os
import boto3
import requests
import openai
import uuid
from pytube import YouTube

Downloading the Audio from YouTube

The download_audio function uses the pytube library to download the audio from a YouTube video. The audio file is saved locally before being uploaded to S3 by the main function. Here’s a complete documentation for pytube: https://pytube.io/en/latest/

def download_audio(video_id):
    yt = YouTube(f'https://www.youtube.com/watch?v={video_id}')
    return yt.streams.get_audio_only().download(filename=video_id)

Transcribing Audio to Text

The transcribe_audio function uses AWS Transcribe to convert the audio into text. The UUID (Universally Unique Identifier) module is used to generate a unique identifier for each transcription job. The benefit of using UUIDs here is that every time we run the function, a new unique job name is created. This is important because AWS Transcribe requires job names to be unique. Here’s the complete documentation of AWS Transcribe: https://docs.aws.amazon.com/transcribe/latest/dg/what-is.html

def transcribe_audio(s3, bucket, file_name):
    transcribe = boto3.client('transcribe')
    job_name = f"TranscriptionJob-{uuid.uuid4()}"
    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': f"s3://{bucket}/{file_name}"},
        MediaFormat='mp4',
        LanguageCode='en-US'
    )
 
    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
 
    return status['TranscriptionJob']['Transcript']['TranscriptFileUri'] if status['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED' else None

Summarizing the Transcript

The summarize_transcript function leverages OpenAI's GPT-3.5-turbo to summarize the transcript. Notice the simple prompt I’ve used for this task. I’ve tried to keep it very short in order to save more tokens for the actual transcript. It can definitely be improved and tweaked according to your preferences. For a complete documentation of the OpenAI API check out this link: https://platform.openai.com/docs/api-reference/introduction

def summarize_transcript(transcript):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a knowledge curator helping users to understand the contents of video transcripts."},
            {"role": "user", "content": f"Please summarize the following transcript: '{transcript}'"}
        ]
    )
    return response['choices'][0]['message']['content'].strip()

Synthesizing Speech from Text

The synthesize_speech function uses AWS Polly to convert the summarized text back into audio. If you prefer other voices or want to tweak different parameters such as speed, language, or dialect, here’s the complete documentation on how to use Polly: https://docs.aws.amazon.com/polly/index.html

def synthesize_speech(s3, bucket, transcript_uri):
    transcript_data = requests.get(transcript_uri).json()
    transcript = ' '.join(item['alternatives'][0]['content'] for item in transcript_data['results']['items'] if item['type'] == 'pronunciation')
 
    summary = summarize_transcript(transcript)
    summary_file_name = f"summary_{uuid.uuid4()}.txt"
    s3.put_object(Body=summary, Bucket=bucket, Key=summary_file_name)
 
    polly = boto3.client('polly')
    response = polly.synthesize_speech(OutputFormat='mp3', Text=summary, VoiceId='Matthew', Engine='neural')
 
    mp3_file_name = f"speech_{uuid.uuid4()}.mp3"
    with open(mp3_file_name, 'wb') as f:
        f.write(response['AudioStream'].read())
 
    return mp3_file_name

The Clean-up of the S3 Bucket

To keep our storage in check and avoid littering the cloud, it’s best to clean up all objects from the bucket. We’ll be able to delete the bucket completely once the audio summary has been downloaded locally.

Remember, we only needed the S3 bucket because it was required by AWS Transcribe and Polly.

def delete_all_objects(bucket_name):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)
    bucket.objects.all().delete()

The Main Function

And finally, the main function, which ties everything together. It specifies the YouTube video to summarize (which you can obviously change to any another video ID), sets up the necessary AWS services and calls the functions defined above in the correct order. It also makes sure to clean up by deleting the S3 bucket after use.

def main():
    video_id = 'U3PiD-g7XJM' #change to any other Video ID from YouTube
   
    bucket = f"bucket-{uuid.uuid4()}"
    file_name = f"{video_id}.mp4"
 
    openai.api_key = os.getenv('OPENAI_API_KEY')
 
    s3 = boto3.client('s3')
    s3.create_bucket(Bucket=bucket)
 
    print ("Downloading audio stream from youtube video...")
    audio_file = download_audio(video_id)
    print ("Uploading video to S3 bucket...")
    s3.upload_file(audio_file, bucket, file_name)
    print("Transcribing audio...")
    transcript_uri = transcribe_audio(s3, bucket, file_name)
    print("Synthesizing speech...")
    mp3_file_name = synthesize_speech(s3, bucket, transcript_uri)
    print(f"Audio summary saved in: {mp3_file_name}\n")
 
    delete_all_objects(bucket)
    s3.delete_bucket(Bucket=bucket)
 
if __name__ == "__main__":
    main()

And that's it! With this simple tool you can now convert any YouTube video into a summarized audio file.

So, sit back, relax and let AI do the work for you.

Enjoy!

About the Author

Andrei Gheorghiu is an experienced trainer with a passion for helping learners achieve their maximum potential. He always strives to bring a high level of expertise and empathy to his teaching.

With a background in IT audit, information security, and IT service management, Andrei has delivered training to over 10,000 students across different industries and countries. He is also a Certified Information Systems Security Professional and Certified Information Systems Auditor, with a keen interest in digital domains like Security Management and Artificial Intelligence.

In his free time, Andrei enjoys trail running, photography, video editing and exploring the latest developments in technology.

You can connect with Andrei on: