Understanding text-to-speech in voice synthesis
TTS is a crucial component in the voice synthesis process, enabling speech to be generated from written text using the synthesized voice. Understanding the fundamentals of TTS is essential to grasp how voice synthesizing works and how it can be applied in various scenarios. Figure 9.1 illustrates a high-level overview of how TTS works in the context of voice synthesis without delving too deeply into technical specifics:
Figure 9.1 – The TTS voice synthesis pipeline
There are five components in the TTS voice synthesis pipeline:
- Text preprocessing:
- The input text is first normalized and preprocessed.
- Numbers, abbreviations, and special characters are expanded into full words.
- The text is divided into individual sentences, words, and phonemes (distinct sound units).
- Text-to-spectrogram:
- The normalized text is converted into a sequence of linguistic features and encoded into a vector representation...