You're reading from Learn OpenAI Whisper Transform your understanding of GenAI through robust and accurate speech processing solutions

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781835085929

Length 372 pages

Edition 1st Edition

Concepts

GPT/LLMs

Author (1):

Josué R. Batista

Preface

1. Part 1: Introducing OpenAI’s Whisper

2. Chapter 1: Unveiling Whisper – Introducing OpenAI’s Whisper FREE CHAPTER

3. Chapter 2: Understanding the Core Mechanisms of Whisper

4. Part 2: Underlying Architecture

5. Chapter 3: Diving into the Whisper Architecture

6. Chapter 4: Fine-Tuning Whisper for Domain and Language Specificity

7. Part 3: Real-world Applications and Use Cases

8. Chapter 5: Applying Whisper in Various Contexts

9. Chapter 6: Expanding Applications with Whisper

10. Chapter 7: Exploring Advanced Voice Capabilities

11. Chapter 8: Diarizing Speech with WhisperX and NVIDIA’s NeMo

12. Chapter 9: Harnessing Whisper for Personalized Voice Synthesis

13. Chapter 10: Shaping the Future with Whisper

14. Index

15. Other Books You May Enjoy

Milestone 3 – Setting up Whisper pipeline components

The process of ASR can be broken down into three main parts:

Feature extractor: This is the initial step of processing the raw audio inputs. Think of it as preparing the audio files, so the model can easily understand and use them. The feature extractor turns the audio into a format that highlights essential aspects of the sound, such as pitch or volume, which are crucial for the model to recognize different words and sounds.
The model: This is the core part of the ASR process. It performs what we call sequence-to-sequence mapping. In simpler terms, it takes the processed audio from the feature extractor and works to convert it into a sequence of text. It’s like translating the language of sounds into the language of text. This part involves complex calculations and patterns to accurately determine what the audio says.
Tokenizer: After the model has done its job of mapping the sounds to text, the tokenizer...