Milestone 2 – Incorporating the Common Voice 11 dataset
The Common Voice dataset, spearheaded by Mozilla, represents a pioneering effort in democratizing speech technology through open and diverse speech corpora. A dataset is a structured collection of data where the rows typically represent individual observations or instances, and the columns represent the features or variables of those instances. In the case of Common Voice, each row represents an audio record, and each column represents features or characteristics applicable to the audio record. As an ever-expanding, community-driven initiative across 100+ languages, Common Voice optimally augments multilingual speech recognition systems like Whisper.
Integrating Common Voice data is straightforward with the Hugging Face Datasets
library. We load the desired language split in streaming mode to bypass extensive storage requirements and expedite fine-tuning workflows:
from datasets import load_dataset, DatasetDict common_voice...