In this chapter, we looked at audio generation using two models, NSynth and GANSynth, and produced many audio clips by interpolating samples and generating new instruments. We started by explaining what WaveNet models are and why they are used in audio generation, particularly in text-to-speech applications. We also introduced WaveNet autoencoders, an encoder and decoder network capable of learning its own temporal embedding. We talked about audio visualization using the reduced dimension of the latent space in rainbowgrams.
Then, we showed the NSynth dataset and the NSynth neural instrument. By showing an example of combining pairs of sounds, we learned how to mix two different encodings together in order to then synthesize the result into new sounds. Finally, we looked at the GANSynth model, a more performant model for audio generation. We showed the example of generating...