You're reading from Learn OpenAI Whisper Transform your understanding of GenAI through robust and accurate speech processing solutions

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781835085929

Length 372 pages

Edition 1st Edition

Concepts

GPT/LLMs

Author (1):

Josué R. Batista

View More author details

Table of Contents (16) Chapters

Preface

1. Part 1: Introducing OpenAI’s Whisper FREE CHAPTER

2. Chapter 1: Unveiling Whisper – Introducing OpenAI’s Whisper

3. Chapter 2: Understanding the Core Mechanisms of Whisper

4. Part 2: Underlying Architecture

5. Chapter 3: Diving into the Whisper Architecture

6. Chapter 4: Fine-Tuning Whisper for Domain and Language Specificity

7. Part 3: Real-world Applications and Use Cases

8. Chapter 5: Applying Whisper in Various Contexts

9. Chapter 6: Expanding Applications with Whisper

10. Chapter 7: Exploring Advanced Voice Capabilities

11. Chapter 8: Diarizing Speech with WhisperX and NVIDIA’s NeMo

12. Chapter 9: Harnessing Whisper for Personalized Voice Synthesis

13. Chapter 10: Shaping the Future with Whisper

14. Index

Why subscribe?

15. Other Books You May Enjoy

Milestone 4 – Transforming raw speech data into Mel spectrogram features

Speech can be considered a one-dimensional array that changes over time, with each point in the array representing the loudness or amplitude of the sound. To understand speech, we need to capture its frequency and acoustic features, which can be done by analyzing the amplitude.

However, speech is a continuous sound stream, and computers can’t handle infinite data. So, we must convert this continuous stream into a series of discrete values by sampling the speech at regular intervals. This sampling is measured in samples per second or Hertz (Hz). The higher the sampling rate, the more accurately it captures the speech, but it also means more data to store every second.

It’s important to ensure that the sampling rate of the audio matches what the speech recognition model expects. If the rates don’t match, it can lead to errors. For example, playing a sound sampled at 16 kHz at 8...