Handling audio data
A lot of work is happening in the audio processing space with the most significant advancements happening in automatic speech recognition (ASR) models. These models transform spoken language into written text, allowing the seamless integration of voice inputs into text-based workflows, thereby making it easier to analyze, search, and interact with. For instance, voice assistants, such as Siri and Google Assistant, rely on ASR to understand and respond to user commands, while transcription services convert meeting recordings into searchable text documents.
This conversion allows the passing of text input to LLMs to unlock powerful capabilities, such as sentiment analysis, topic modeling, automated summarization, and even supporting chat applications. For example, customer service call centers can use ASR to transcribe conversations, which can then be analyzed for customer sentiment or common issues, improving service quality and efficiency.
Handling audio data as text not only enhances accessibility and usability but also facilitates more efficient data storage and retrieval. Text data takes up less space than audio files and is easier to index and search. Moreover, it bridges the gap between spoken and written communication, enabling more natural and intuitive user interactions across various platforms and devices. For instance, integrating ASR in educational apps can help students with disabilities access spoken content in a text format, making learning more inclusive.
As ASR technologies continue to improve, the ability to accurately and efficiently convert audio to text will become increasingly important, driving innovation and expanding the potential of AI-driven solutions. Enhanced ASR models will further benefit areas such as real-time translation services, automated note-taking in professional settings, and accessibility tools for individuals with hearing impairments, showcasing the broad and transformative impact of this technology.
In the next section, we will discuss the Whisper model, which is effective for transforming audio into text and performing a range of audio processing tasks.
Using Whisper for audio-to-text conversion
The Whisper model from OpenAI is a powerful tool for transforming audio to text and serves as a base for many modern AI and ML applications. The applications range from real-time transcription and customer service to healthcare and education, showcasing its versatility and importance in the evolving landscape of audio processing technology:
- Whisper can be integrated into voice assistant systems, such as Siri, Google Assistant, and Alexa, to accurately transcribe user commands and queries.
- Call centers can use Whisper to transcribe customer interactions, allowing for sentiment analysis, quality assurance, and topic detection, thereby enhancing service quality.
- Platforms such as YouTube and podcast services can use Whisper to generate subtitles and transcriptions, improving accessibility and content indexing.
- Whisper can be used in real-time transcription services for meetings, lectures, and live events. This helps create accurate text records that are easy to search and analyze later.
- In telemedicine, Whisper can transcribe doctor-patient conversations accurately, facilitating better record-keeping and analysis. Moreover, it can assist in creating automated medical notes from audio recordings.
- Educational platforms can use Whisper to transcribe lectures and tutorials, providing students with written records of spoken content, enhancing learning and accessibility.
- Security systems use direct audio processing to verify identity based on unique vocal characteristics, offering a more secure and non-intrusive method of authentication.
As a pretrained model, Whisper can be used out of the box for many tasks, reducing the need for extensive fine-tuning and allowing for quick integration into various applications. The model supports multiple languages, making it versatile for global applications and diverse user bases. While Whisper primarily focuses on transforming audio to text, it also benefits from advancements in handling audio signals, potentially capturing nuances, such as tone and emotion. Although direct audio processing (such as emotion detection or music analysis) might require additional specialized models, Whisper’s robust transcription capability is foundational for many applications.
Using some audio from the Vector Lab (@VectorLab
) videos, we will parse the audio through Whisper to get the extracted text.
Extracting text from audio
The following code demonstrates how to use the Whisper model from Hugging Face to transcribe audio files into text. It covers loading necessary libraries, processing an audio file, generating a transcription using the model, and finally decoding and printing the transcribed text. Let’s have a look at the code, which you can also find here: https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter13/5.whisper.py.
Let’s begin:
- We’ll start by importing the required libraries:
import torch from transformers import WhisperProcessor, WhisperForConditionalGeneration import librosa
- We start by loading the Whisper processor and model from Hugging Face:
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
- Next, we define the path to your audio file:
audio_path = "chapter13/audio/3.chain orchestrator.mp3"
You can replace this file with any other audio you want.
- Then, we load the audio file:
audio, rate = librosa.load(audio_path, sr=16000)
Let’s expand on this:
audio
will be a NumPy array containing the audio samples.rate
is the sampling rate of the audio file. Thesr=16000
argument resamples the audio to a sampling rate of 16 kHz, which is the required input sampling rate for the Whisper mode.
- Now, we preprocess the audio file for the Whisper model:
input_features = processor(audio, sampling_rate=rate, return_tensors="pt").input_features
- We then generate the transcription:
with torch.no_grad(): predicted_ids = model.generate(input_features)
This line passes the preprocessed audio features to the model to generate transcription IDs. The model produces token IDs that correspond to the transcribed text.
- Now, we decode the generated transcription:
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
This line decodes the predicted token IDs back into readable text. The
[0]
value at the end extracts the first (and only) transcription from the resulting list. - Finally, we print the transcribed text:
"As you can see, you need what we call a chain orchestrator to coordinate all the steps. So all the steps from raising the question all the way to the response. And the most popular open source packages are Lama Index and LangChain that we can recommend. Very nice. So these chains, these steps into the RAG application or any other LLM application, you can have many steps happening, right? So you need this chain to help them orchestrate"
Is the transcription slow?
Depending on the model size and your hardware capabilities, the transcription process might take some time.
As we can see, the transcription is excellent. Now, in the use case we are dealing with, after transcribing the YouTube video, there are several valuable actions you can take on this project. First, you can create captions or subtitles to improve accessibility for viewers who are deaf or hard of hearing. Additionally, writing a summary or extracting key points can help viewers grasp the main ideas without watching the entire video. The transcription can also be transformed into a blog post or article, providing more context on the topic discussed. Extracting quotes or highlights from the transcription allows you to create engaging social media posts that promote the video. Utilizing the transcription for SEO purposes can improve the video’s search engine ranking by including relevant keywords in the description. You can also develop FAQs or discussion questions based on the video to encourage viewer engagement. Additionally, the transcription can serve as a reference for research, and you might consider adapting it into a script for an audiobook or podcast. Incorporating the transcription into educational materials, such as lesson plans, is another effective way to utilize the content. Lastly, you can create visual summaries or infographics based on the key points to present the main ideas visually. How cool is that?
In the following section, we will expand the use case and do some emotion detection from the transcribed text.
Detecting emotions
Emotion detection from text, often referred to as sentiment analysis or emotion recognition, is a subfield of natural language processing (NLP) that focuses on identifying and classifying emotions conveyed in written content. This area of study has gained significant traction due to the growing amount of textual data generated across social media, customer feedback, and other platforms.
In our case, we will use the j-hartmann/emotion-english-distilroberta-base
model, built upon the DistilRoBERTa architecture. The DistilRoBERTa model is a smaller and faster variant of the RoBERTa model, which itself is based on the Transformer architecture. This model is specifically fine-tuned for emotion detection tasks. It has been trained on a dataset designed to recognize various emotions expressed in text, making it adept at identifying and classifying emotions from written content. It is designed to detect the following emotions from text:
- Joy: This represents happiness and positivity
- Sadness: This reflects feelings of sorrow and unhappiness
- Anger: This indicates feelings of frustration, annoyance, or rage
- Fear: This conveys feelings of anxiety or apprehension
- Surprise: This represents astonishment or unexpectedness
- Disgust: This reflects feelings of aversion or distaste
- Neutral: This indicates a lack of strong emotion or feeling
These emotions are typically derived from various datasets that categorize text based on emotional expressions, allowing the model to classify input text into these predefined categories.
Let’s have a look at the code, which is also available here: https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter13/6.emotion_detection.py.
Memory check
The following code is memory intensive, so you may need to allocate more memory if working on virtual machines or Google Collab. The code was tested on Mac M1, 16 GB memory.
Let’s start coding:
- We first import the libraries required for this example:
import torch import pandas as pd from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutoModelForSequenceClassification, AutoTokenizer import librosa import numpy as np
- We then load the Whisper processor and model from Hugging Face:
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2") whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
- Then, we load the emotion detection processor and model from Hugging Face:
emotion_model_name = "j-hartmann/emotion-english-distilroberta-base" emotion_tokenizer = AutoTokenizer.from_pretrained(emotion_model_name) emotion_model = AutoModelForSequenceClassification.from_pretrained(emotion_model_name)
- We define the path to your audio file:
audio_path = "chapter13/audio/3.chain orchestrator.mp3" # Replace with your actual audio file path
- Once the path is defined, we load the audio file:
audio, rate = librosa.load(audio_path, sr=16000)
- We create a function called
split_audio
to split audio into chunks:def split_audio(audio, rate, chunk_duration=30): chunk_length = int(rate * chunk_duration) num_chunks = int(np.ceil(len(audio)/chunk_length)) return [audio[i*chunk_length:(i+1)*chunk_length] for i in range(num_chunks)]
- We also create a function to transcribe audio using Whisper:
def transcribe_audio(audio_chunk, rate): input_features = whisper_processor(audio_chunk, sampling_rate=rate, return_tensors="pt").input_features with torch.no_grad(): predicted_ids = whisper_model.generate(input_features) transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] return transcription
The function preprocesses the audio file for the Whisper model and generates the transcription. Once it’s generated, the function decodes the generated transcription.
- We then create a function to detect emotions from text using the emotion detection model:
def detect_emotion(text): inputs = emotion_tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) outputs = emotion_model(inputs) predicted_class_id = torch.argmax(outputs.logits, dim=-1).item() emotions = emotion_model.config.id2label return emotions[predicted_class_id]
This function begins by tokenizing the input text with
emotion_tokenizer
, converting it into PyTorch tensors while handling padding, truncation, and maximum length constraints. The tokenized input is then fed intoemotion_model
, which generates raw prediction scores (logits) for various emotion classes. The function identifies the emotion with the highest score usingtorch.argmax
to determine the class ID. This ID is then mapped to the corresponding emotion label through theid2label
dictionary provided by the model’s configuration. Finally, the function returns the detected emotion as a readable label! - Then, we split the audio into chunks:
audio_chunks = split_audio(audio, rate, chunk_duration=30) # 30-second chunks
- We also create a DataFrame to store the results:
df = pd.DataFrame(columns=['Chunk Index', 'Transcription', 'Emotion'])
- Finally, we process each audio chunk:
for i, audio_chunk in enumerate(audio_chunks): transcription = transcribe_audio(audio_chunk,rate) emotion = detect_emotion(transcription) # Append results to DataFrame df.loc[i] = [i, transcription, emotion]
The output emotions are shown for each chunk of transcribed text, and in our case, all are neutral, as the video is just a teaching concept video:
Chunk Index Emotion 0 0 neutral 1 1 neutral 2 2 neutral
Now, we will expand our use case a bit further to demonstrate how you can take the transcribed text and pass it through an LLM to create highlights for the YouTube video.
Automatically creating video highlights
In the era of digital content consumption, viewers often seek concise and engaging summaries of longer videos. Automatically creating video highlights involves analyzing video content and extracting key moments that capture the essence of the material. This process saves time and improves content accessibility, making it a valuable tool for educators, marketers, and entertainment providers alike.
Let’s have a look at the code. You can find it at the following link: https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter13/7.write_highlights.py.
In this code, we will expand the Whisper example. We will transcribe the text, then join all the transcribed chunks together, and finally, we will pass all the transcriptions to the LLM to create the highlights for the entire video. Let’s continue the previous example:
- We start by initializing the Hugging Face model:
model_name = "mistralai/Mistral-Nemo-Instruct-2407" # Using Mistral for instruction-following
- Then, we add your Hugging Face API token:
api_token = "" # Replace with your actual API token
- Here’s the LangChain setup that we’ll be using in this use case. Notice the new prompt that we added:
prompt_template = PromptTemplate( input_variables=["text"], template='''This is the transcribed text from a YouTube video. Write the key highlights from this video in bullet format. {text} Output: ''' ) huggingface_llm = HuggingFaceHub(repo_id=model_name, huggingfacehub_api_token=api_token, model_kwargs={"task": "text-generation"}) llm_chain = LLMChain(prompt=prompt_template, llm=huggingface_llm)
- Next, we generate the transcription:
def transcribe_audio(audio_chunk, rate): input_features = whisper_processor(audio_chunk, sampling_rate=rate, return_tensors="pt").input_features with torch.no_grad(): predicted_ids = \ whisper_model.generate(input_features) transcription = whisper_processor.batch_decode( predicted_ids, skip_special_tokens=True)[0] return transcription
- Then, we create a function to generate the key highlights from text using the LLM:
def generate_highlights(text): try: response = llm_chain.run(text) return response.strip() # Clean up any whitespace around the response except Exception as e: print(f"Error generating highlights: {e}") return "error" # Handle errors gracefully
- Next, we split the audio into chunks:
audio_chunks = split_audio(audio, rate, chunk_duration=30) # 30-second chunks
- We then transcribe each audio chunk:
transcriptions = [transcribe_audio(chunk, rate) for chunk in audio_chunks]
- Then, we join all transcriptions into a single text:
full_transcription = " ".join(transcriptions)
- Finally, we generate highlights from the full transcription:
highlights = generate_highlights(full_transcription)
Let’s see the automatically created highlights:
Chain Orchestrator: Required to coordinate all steps in a LLM (Large Language Model) application, such as RAG (Retrieval-Augmented Generation). Popular Open Source Packages: Lama Index and LangChain are recommended for this purpose. Modularization: Chains allow for modularization of the process, making it easier to update or change components like LMs or vector stores without rebuilding the entire application. Rapid Advancements in JNNIA
As we can see, there are some minor mistakes, mainly coming from the Whisper process, but other than that, it is actually pretty good.
In the next part, we will quickly review the research happening in the audio space, as it is a rapidly evolving field.
Future research in audio preprocessing
There is a growing trend toward the development of multimodal LLMs capable of processing various types of data, including audio. Currently, many language models are primarily text-based, but we anticipate the emergence of models that can handle text, images, and audio simultaneously. These multimodal LLMs have diverse applications, such as generating image captions and providing medical diagnoses based on patient reports. Research is underway to extend LLMs to support direct speech inputs. As noted, “Several studies have attempted to extend LLMs to support direct speech inputs with a connection module” (https://arxiv.org/html/2406.07914v2), indicating ongoing efforts to incorporate audio processing capabilities into LLMs. Although not only relevant to audio, LLMs face several challenges with other data types, including the following:
- High computational resources required for processing
- Data privacy and security concerns
Researchers are actively exploring various strategies to overcome these challenges. To address the high computational demands, there is a focus on developing more efficient algorithms and architectures, such as transformer models with reduced parameter sizes and optimized training techniques. Techniques such as model compression, quantization, and distillation are being employed to make these models more resource-efficient without sacrificing performance (https://arxiv.org/abs/2401.13601, https://arxiv.org/html/2408.04275v1, https://arxiv.org/html/2408.01319v1). In terms of data privacy and security, researchers are investigating privacy-preserving ML techniques, including federated learning and differential privacy. These approaches aim to protect sensitive data by allowing models to learn from decentralized data sources without exposing individual data points. Additionally, advancements in encryption and secure multi-party computation are being integrated to ensure that data remains confidential throughout the processing pipeline. These efforts are crucial for enabling the widespread adoption of multimodal LLMs across various domains while ensuring they remain efficient and secure (https://towardsdatascience.com/differential-privacy-and-federated-learning-for-medical-data-0f2437d6ece9, https://arxiv.org/pdf/2403.05156, https://pair.withgoogle.com/explorables/federated-learning/).
Let’s now summarize the learnings from this chapter.