Milestone 3 – Setting up Whisper pipeline components
The process of ASR can be broken down into three main parts:
- Feature extractor: This is the initial step of processing the raw audio inputs. Think of it as preparing the audio files, so the model can easily understand and use them. The feature extractor turns the audio into a format that highlights essential aspects of the sound, such as pitch or volume, which are crucial for the model to recognize different words and sounds.
- The model: This is the core part of the ASR process. It performs what we call sequence-to-sequence mapping. In simpler terms, it takes the processed audio from the feature extractor and works to convert it into a sequence of text. It’s like translating the language of sounds into the language of text. This part involves complex calculations and patterns to accurately determine what the audio says.
- Tokenizer: After the model has done its job of mapping the sounds to text, the tokenizer...