Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale

Save for later
  • 5 min read
  • 03 Dec 2018

article-image

In the paper ‘The Challenge of realistic music generation: modelling raw audio at scale’, researchers from DeepMind have embarked on modelling music in the raw audio domain. They have explored autoregressive discrete autoencoders (ADAs) to enable autoregressive models to capture long-range correlations in waveforms.

Autoregressive models are the best while generating raw audio waveforms of speech, but when applied to music, they are more biased towards capturing local signal structure at the expense of modelling long-range correlations. Since music exhibits structure at many different timescales, this issue is problematic; thereby making realistic music generation a challenging task. This paper will be presented in the 32nd Conference on Neural Information Processing Systems (NIPS 2018) to be held at Montréal, Canada this week.

Challenges when music is symbolically represented


Music has a complex structure by nature and is made up of waveforms that spans over different time periods and magnitudes. Therefore, modelling all of the temporal correlations in the sequence that arise from this structure is challenging.

Most of the work in music generation has focused on symbolic representations. This method however has multiple limitations. Symbolic representations abstract away the idiosyncrasies of a particular performance, and these nuances abstracted away are often musically quite important, impacting a user’s enjoyment of music. The paper states an example of the precise timing, timbre and volume of the notes played by a musician do not correspond exactly to those written in a score.

Symbolic representations are often tailored to particular instruments, reducing their generality, thereby leading to a lot of work being applied to existing modelling techniques to new instruments. Digital representations of audio waveforms retain all the musically relevant information. These models can be applied to recordings of any set of instruments. However, the task is challenging as compared to modelling symbolic representations. These generative models of waveforms capturing musical structure at many timescales requires high representational capacity, distributed effectively over the various musically-relevant timescales.

Steps performed to address music generation in the raw audio domain

  1. The researchers use autoregressive models to model structure across roughly 400,000 timesteps, or about 25 seconds of audio sampled at 16 kHz. They demonstrate a computationally efficient method to enlarge their receptive fields using autoregressive discrete autoencoders (ADAs).
  2. They explore the domain of autoregressive models for this task, while they use the argmax autoencoder (AMAE) as an alternative to vector quantisation variational autoencoders (VQ-VAE). This autoencoder converges more reliably when trained on a challenging dataset.


To model long-range structure in musical audio signals, the receptive fields (RFs) of AR models have to be enlarged. One way to do this is by providing a rich conditioning signal. The paper concentrates on this notion which turns an AR model into an autoencoder by attaching an encoder to learn a high-level conditioning signal directly from the data. Temporal downsampling operations can be inserted into the encoder to make this signal more coarse-grained than the original waveform. The resulting autoencoder uses its AR decoder to model any local structure that this compressed signal cannot capture.

The researchers went on to compare the two techniques that can be used to model the raw audio: Vector quantisation variational autoencoders and the argmax autoencoder (AMAE).

Vector quantisation variational autoencoders use vector quantisation (VQ): the queries are vectors in a d-dimensional space, and a codebook of k such vectors is learnt on the fly, together with the rest of the model parameters. The loss function is as follows:

LV Q−V AE = − log p(x|qj) + (qj − [q])2 + β · ([qj] − q)2.


However, the issue with VQ-VAEs when trained on challenging (i.e. high-entropy) datasets is that they often suffer from codebook collapse. At some point during training, some portion of the codebook may fall out of use and the model will no longer use the full capacity of the discrete bottleneck, leading to worse results and poor reconstructions.

As an alternative to VQ-VAE method, the researchers have come up with a model called the argmax autoencoder (AMAE). This produces k-dimensional queries, and features a nonlinearity that ensures all outputs are on the (k 1)-simplex. The quantisation operation is then simply an argmax operation, which is equivalent to taking the nearest k-dimensional one-hot vector in the Euclidean sense.

This projection onto the simplex limits the maximal quantization error. This makes the gradients that pass through it more accurate. To make sure the full capacity is used, an additional diversity loss term is added which encourages the model to use all outputs in equal measure. This loss can be computed using batch statistics, by averaging all queries q (before quantisation) across the batch and time axes, and encouraging the resulting vector q¯ to resemble a uniform distribution.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime

Results of the experiment


This is what the researchers achieved:

  1. Addressed the challenge of music generation in the raw audio domain autoregressive models and extending their receptive fields in a computationally efficient manner.
  2. Introduced the argmax autoencoder (AMAE), an alternative to VQ-VAE which shows improved stability for the task.
  3. Using separately trained autoregressive models at different levels of abstraction captures long-range correlations in audio signals across tens of seconds, corresponding to 100,000s of timesteps, at the cost of some signal fidelity.


You can refer to the paper for a comparison of results obtained across various autoencoders and for more insights on this topic.

Exploring Deep Learning Architectures [Tutorial]
Implementing Autoencoders using H2O
What are generative adversarial networks (GANs) and how do they work? [Video]