Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!
ChatGPT has earned its reputation as a versatile and capable assistant. From helping you craft the perfect piece of writing, planning your next adventure, aiding your coding endeavors, or simply engaging in light-hearted conversations, ChatGPT can do it all. It's like having a digital Swiss Army knife at your fingertips. But have you ever wondered what it would be like if ChatGPT could communicate with you not just through text, but also through speech? Imagine the convenience of issuing voice commands and receiving spoken responses, just like your own personal Siri. Well, the good news is, that this is now possible thanks to the remarkable combination of OpenAI Whisper and Bark.
Bringing the power of voice interaction to ChatGPT is a game-changer. Instead of typing out your queries and waiting for text-based responses, you can seamlessly converse with ChatGPT, making your interactions more natural and efficient. Whether you're a multitasking enthusiast, a visually impaired individual, or someone who prefers spoken communication, this development holds incredible potential.
So, how is this magic achieved? The answer lies in the fusion of two crucial components: Speech-to-Text (STT) and Text-to-Speech (TTS) modules.
STT, as the name suggests, is the technology responsible for converting spoken words into text. OpenAI's Whisper is a groundbreaking pre-trained model for Automatic Speech Recognition (ASR) and speech translation. The model has been trained on an astonishing 680,000 hours of labeled data, giving it an impressive ability to adapt to a variety of datasets and domains without the need for fine-tuning.
Whisper comes in two flavors: English-only and multilingual models. The English-only models are trained for the specific task of speech recognition, where they accurately predict transcriptions in the same language as the spoken audio. The multilingual models, on the other hand, are trained to handle both speech recognition and speech translation. In this case, the model predicts transcriptions in a language different from the source audio, adding an extra layer of versatility. Imagine speaking in one language and having ChatGPT instantly respond in another - Whisper makes it possible.
On the other side of the conversation, we have Text-to-Speech (TTS) technology. This essential component converts ChatGPT's textual responses into lifelike speech. Bark, an open-source model developed by Suno AI, is a transformer-based text-to-speech marvel. It's what makes ChatGPT's spoken responses sound as engaging and dynamic as Siri's.
Just like with Whisper, Bark is a reliable choice for its remarkable ability to turn text into speech, creating a human-like conversational experience. ChatGPT now not only thinks like a human but speaks like one too, thanks to Bark.
The beauty of this integration is that it doesn't require you to be a tech genius. HuggingFace, a leading platform for natural language processing, supports both the TTS and STT pipeline. In simpler terms, it streamlines the entire process, making it accessible to anyone.
You don't need to be a master coder or AI specialist to make it work. All you have to do is select the model you prefer for STT (Whisper) and another for TTS (Bark). Input your commands and queries, and let HuggingFace take care of the rest. The result? An intelligent, voice-activated ChatGPT can assist you with whatever you need.
Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to utilize both Whisper and Bark along with OpenAI GPT-3.5-Turbo to create your own Siri!
OpenAI Whisper is a powerful ASR/STT model that can be seamlessly integrated into your projects. It has been pre-trained on an extensive dataset, making it highly capable of recognizing and transcribing spoken language.
Here's how you can use OpenAI Whisper for STT with HuggingFace pipeline. Note that the `sample_audio` here will be the user’s command to the ChatGPT.
from transformers import pipeline
stt = pipeline(
"automatic-speech-recognition",
model="openai/whisper-medium",
chunk_length_s=30,
device=device,
)
text = stt(sample_audio, return_timestamps=True)["text"]
The foundation of any AI model's prowess lies in the data it's exposed to during its training. Whisper is no exception. This ASR model has been trained on a staggering 680,000 hours of audio data and the corresponding transcripts, all carefully gathered from the vast landscape of the internet.
Here's how that massive amount of data is divided:
● English Dominance (65%): A substantial 65% of the training data, which equates to a whopping 438,000 hours, is dedicated to English-language audio and matched English transcripts. This abundance of English data ensures that Whisper excels in transcribing English speech accurately.
● Multilingual Versatility (18%): Whisper doesn't stop at English. About 18% of its training data, roughly 126,000 hours, focuses on non-English audio paired with English transcripts. This diversity makes Whisper a versatile ASR model capable of handling different languages while still providing English transcriptions.
● Global Reach (17%): The remaining 17%, which translates to 117,000 hours, is dedicated to non-English audio and the corresponding transcripts. This extensive collection represents a stunning 98 different languages. Whisper's proficiency in transcribing non-English languages is a testament to its global reach.
With the user's speech command now transcribed into text, the next step is to harness the power of ChatGPT or GPT-3.5-Turbo. This is where the real magic happens. These advanced language models have achieved fame for their diverse capabilities, whether you need help with writing, travel planning, coding, or simply engaging in a friendly conversation.
There are several ways to integrate ChatGPT into your system:
No matter which method you choose, ChatGPT will take your transcribed speech command and generate a thoughtful, context-aware text-based response, ready to assist you in any way you desire. We’ll not deep dive into this in this article since there are numerous articles explaining this already.
The final piece of the puzzle is Bark, an open-source TTS model. Bark works its magic by converting ChatGPT's textual responses into lifelike speech, much like Siri talks to you. It adds that crucial human touch to the conversation, making your interactions with ChatGPT feel more natural and engaging.
Again, we can build the TTS pipeline very easily with the help of HuggingFace pipeline. Here's how you can use Bark for TTS with HuggingFace pipeline. Note that the `text` here will be the ChatGPT response to the user’s command.
from transformers import pipeline
tts = pipeline("text-to-speech", model="suno/bark-small")
response = tts(text)
from IPython.display import Audio
Audio(response["audio"], rate=response["sampling_rate"])
You can see the example quality of the Bark model in this Google Colab notebook.
Congratulations on keeping up to this point! Throughout this article, you have learned how to build your own Siri with the help of OpenAI Whisper, ChatGPT, and Bark. Hope the best for your experiment in creating your own Siri and see you in the next article!
Louis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects.
Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.