Gaining insights into data, annotation, and model training
Now that we’ve covered Whisper’s semi-supervised training methodology, the next step is to dive deeper into curating optimal data for driving targeted performance gains. While web-scale corpora provide a strong starting point, fine-tuning for niche applications requires customized dataset development.
Keep in mind the concepts we already learned about regarding how transformers process sequences. Traditional sequence-to-sequence models, such as RNNs, process input sequences step by step, which can be time-consuming for long sequences. In contrast, transformers can simultaneously process all words in the input sequence, leading to faster training times. Whisper’s transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, translation, spoken language identification, and voice activity detection. As shown in Figure 3.4, these tasks are...