WaveNet is a deep generative network that is used to generate raw audio waveforms. Sounds waves are generated by WaveNet to mimic the human voice. This generated sound is more natural than any of the currently existing text-to-speech systems, reducing the gap between system and human performance by 50%.
With a single WaveNet, we can differentiate between multiple speakers with equal fidelity. We can also switch between individual speakers based on their identity. This model is autoregressive and probabilistic, and it can be trained efficiently on thousands of audio samples per second. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning the speaker identity.
As shown in the movie Her, the long-standing dream of human-computer interaction is to allow people to talk to machines. The...