PVS step 1 – Converting audio files into LJSpeech format
This section and the accompanying notebook, LOAIW_ch09_2_Processing_audio_to_LJ_format_with_Whisper_OZEN.ipynb
, represent the initial step in the three-step PVS process outlined in this chapter. This step takes an audio sample of the target voice as input and processes it into the LJSpeech dataset format. The notebook demonstrates using the OZEN Toolkit and OpenAI’s Whisper to extract speech, transcribe it, and organize the data according to the LJSpeech structure. The resulting LJSpeech-formatted dataset, consisting of segmented audio files and corresponding transcriptions, serves as the input for the second step, PVS step 2 – Fine-tuning a discrete variational autoencoder using the DLAS toolkit, where a PVS model will be fine-tuned using this dataset.
An LJSpeech-formatted dataset is crucial in TTS models as it provides a standardized structure for organizing audio files and their corresponding transcriptions...