In the previous section, we used NSynth to generate new sound samples by combining existing sounds. You may have noticed that the audio synthesis process is very time-consuming. This is because autoregressive models, such as WaveNet, focus on a single audio sample, which makes the resulting reconstruction of the waveform really slow because it has to process them iteratively.
GANSynth, on the other hand, uses upsampling convolutions, making the training and generation processing in parallel possible for the entire audio sample. This is a major advantage over autoregressive models such as NSynth since those algorithms tend to be I/O bound on GPU hardware.
The results of GANSynth are impressive:
- Training on the NSynth dataset converges in ~3-4 days on a single V100 GPU. For comparison, the NSynth WaveNet model converges in 10 days on 32...