WaveNet is a generative model of time domain waveforms. It produces natural sounding audio fidelity and is already used in some complete TTS systems. However, the inputs to WaveNet need significant domain expertise to produce as they require elaborate text-analysis systems and a detailed pronunciation guide.
Tacotron is a sequence-to-sequence architecture for producing magnitude spectrograms from a sequence of characters i.e. it synthesizes speech directly from words. It uses a single neural network trained from data alone for production of the linguistic and acoustic features .Tacotron uses the Griffin-Lim algorithm for phase estimation. Griffin-Lim produces characteristic artifacts and lower audio fidelity than approaches like WaveNet. Although Tacotron was efficient with respect to patterns of rhythm and sound, it wasn’t actually suited for producing a final speech product.
Tacotron 2 is a conjunction of the above described approaches. It features a tacotron style, recurrent sequence-to-sequence feature prediction network that generates mel spectrograms. Followed by a modified version of WaveNet which generates time-domain waveform samples conditioned on the generated mel spectrogram frames.
Source: https://arxiv.org/pdf/1712.05884.pdf
In contrast to Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder. Also, each decoder step corresponds to a single spectrogram frame.
The original WaveNet used linguistic features, phoneme durations, and log F0 at a frame rate of 5 ms. However, these lead to significant pronunciation issues when predicting spectrogram frames spaced this closely. Hence, the WaveNet architecture used in Tacotron 2 work with 12.5 ms feature spacing by using only 2 upsampling layers in the transposed convolutional network.
Here’s how it works:
Tacotron 2 system can be trained directly from data without relying on complex feature engineering. It achieves state-of-the-art sound quality close to that of natural human speech. Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. Google has also provided some Tacotron 2 audio samples that demonstrate the results of their TTS system.
In the future, Google would work on improving their system to pronounce complex words, generate audio in realtime, and directing a generated speech to sound happy or sad.
The entire paper is available for reading at Arxiv archives here.