In the paper ‘The Challenge of realistic music generation: modelling raw audio at scale’, researchers from DeepMind have embarked on modelling music in the raw audio domain. They have explored autoregressive discrete autoencoders (ADAs) to enable autoregressive models to capture long-range correlations in waveforms.
Autoregressive models are the best while generating raw audio waveforms of speech, but when applied to music, they are more biased towards capturing local signal structure at the expense of modelling long-range correlations. Since music exhibits structure at many different timescales, this issue is problematic; thereby making realistic music generation a challenging task. This paper will be presented in the 32nd Conference on Neural Information Processing Systems (NIPS 2018) to be held at Montréal, Canada this week.
Music has a complex structure by nature and is made up of waveforms that spans over different time periods and magnitudes. Therefore, modelling all of the temporal correlations in the sequence that arise from this structure is challenging.
Most of the work in music generation has focused on symbolic representations. This method however has multiple limitations. Symbolic representations abstract away the idiosyncrasies of a particular performance, and these nuances abstracted away are often musically quite important, impacting a user’s enjoyment of music. The paper states an example of the precise timing, timbre and volume of the notes played by a musician do not correspond exactly to those written in a score.
Symbolic representations are often tailored to particular instruments, reducing their generality, thereby leading to a lot of work being applied to existing modelling techniques to new instruments. Digital representations of audio waveforms retain all the musically relevant information. These models can be applied to recordings of any set of instruments. However, the task is challenging as compared to modelling symbolic representations. These generative models of waveforms capturing musical structure at many timescales requires high representational capacity, distributed effectively over the various musically-relevant timescales.
To model long-range structure in musical audio signals, the receptive fields (RFs) of AR models have to be enlarged. One way to do this is by providing a rich conditioning signal. The paper concentrates on this notion which turns an AR model into an autoencoder by attaching an encoder to learn a high-level conditioning signal directly from the data. Temporal downsampling operations can be inserted into the encoder to make this signal more coarse-grained than the original waveform. The resulting autoencoder uses its AR decoder to model any local structure that this compressed signal cannot capture.
The researchers went on to compare the two techniques that can be used to model the raw audio: Vector quantisation variational autoencoders and the argmax autoencoder (AMAE).
Vector quantisation variational autoencoders use vector quantisation (VQ): the queries are vectors in a d-dimensional space, and a codebook of k such vectors is learnt on the fly, together with the rest of the model parameters. The loss function is as follows:
LV Q−V AE = − log p(x|qj) + (qj − [q])2 + β · ([qj] − q)2.
However, the issue with VQ-VAEs when trained on challenging (i.e. high-entropy) datasets is that they often suffer from codebook collapse. At some point during training, some portion of the codebook may fall out of use and the model will no longer use the full capacity of the discrete bottleneck, leading to worse results and poor reconstructions.
As an alternative to VQ-VAE method, the researchers have come up with a model called the argmax autoencoder (AMAE). This produces k-dimensional queries, and features a nonlinearity that ensures all outputs are on the (k 1)-simplex. The quantisation operation is then simply an argmax operation, which is equivalent to taking the nearest k-dimensional one-hot vector in the Euclidean sense.
This projection onto the simplex limits the maximal quantization error. This makes the gradients that pass through it more accurate. To make sure the full capacity is used, an additional diversity loss term is added which encourages the model to use all outputs in equal measure. This loss can be computed using batch statistics, by averaging all queries q (before quantisation) across the batch and time axes, and encouraging the resulting vector q¯ to resemble a uniform distribution.
This is what the researchers achieved:
You can refer to the paper for a comparison of results obtained across various autoencoders and for more insights on this topic.
Exploring Deep Learning Architectures [Tutorial]
Implementing Autoencoders using H2O
What are generative adversarial networks (GANs) and how do they work? [Video]