In this section, we'll be combining different audio clips together. We'll learn to encode the audio, optionally saving the resulting encodings on disk, mix (add) them, and then decode the added encodings to retrieve a sound clip.
We'll be handling 1-second audio clips only. There are two reasons for this: first, handling audio is costly, and second, we want to generate instrument notes in the form of short audio clips. The latter is interesting for us because we can then sequence the audio clips using MIDI generated by the models we've been using in the previous chapters. In that sense, you can view NSynth as a generative instrument, and the previous models, such as MusicVAE or Melody RNN, as a generative score (partition) composer. With both elements, we can generate full tracks, with audio and structure.
To generate sound...