Here are some ways that deep learning will elevate music and the listening experience itself:
At the most basic level, a deep learning algorithm follows 3 simple steps for music generation:
Long short-term memory (LSTM) architectures are also used for music generation. They take structured input of a music notation. These inputs are then encoded as vectors and fed into an LSTM at each timestep. LSTM then predicts the encoding of the next timestep. Fully connected convolutional layers are utilized to increase the music quality and to represent rich features in the frequency domain.
Magenta, the popular art and music project of Google has launched Performance RNN, which is an LSTM-based recurrent neural network. It is designed to produce multiple sounds with expressive timing and dynamics. In other words, Performance RNN determines which notes to play, when to play them, and how hard to strike each note.
IBM’s Watson Beat uses a neural network to produce complete tracks by understanding music theory, structure, and emotional intent. According to Richard Daskas, a music composer working on the Watson Beat project, “Watson only needs about 20 seconds of musical inspiration to create a song.”
Deep learning methods can also be used for arranging a piece of music for a different instrument. LSTM networks are a popular choice for music transcription and modelling. These networks are trained using a large dataset of pre-labelled music transcriptions (expressed with ABC notation). These transcriptions are then used to generate new music transcriptions.
In fact, transformed audio data can be used to predict the group of notes currently being played. This can be achieved by treating the transcription model as an image classification problem. For this, an image of an audio is used, called as Spectrogram. A spectrogram displays how the spectrum or frequency content changes over time. A Short Time Fourier Transform (STFT) or a constant Q transform is used to create this spectrogram. The spectrogram is then feeded to a Convolutional Neural network(CNN). The CNN estimates current notes from audio data and determines what specific notes are present by analysing 88 output nodes for each of the piano keys. This network is generally trained using large number of examples from MIDI files spanning several different genres of music.
Magenta has developed The NSynth dataset, which is a high-quality multi-note dataset for music transcription. It is inspired by image recognition datasets and has a huge collection of annotated musical notes.
Neural Nets are also used to make intelligent music recommendations and are a step ahead of the traditional Collaborative filtering networks. Using neural networks, the system can analyse the songs saved by the users, and then utilize those songs to make new recommendations. Neural nets can also be used to analyze songs based on musical qualities such as pitch, chord progression, bass, etc. Using the similarities between songs having the same traits as each other, neural networks can detect and predict new songs. Thus providing recommendation based on similar lyrical and musical styles.
Convolutional neural networks (CNNs) are utilized for making music recommendations. A time-frequency representation of the audio signal is fed into the network as the input. 3 second audio clips are randomly chosen from the audio samples to train the neural network.
The CNNs are then used to predict latent factors from music audio by taking the average of the predictions for consecutive clips. The feature extraction layers and pooling layers permits operation on several timescales.
Spotify is working on a music recommendation system with a CNN. This recommendation system, when trained on short clips of songs, can create playlists based on the audio content only.
Classifying music according to a genre is another achievement of neural nets. At the heart of this application lies the LSTM network.
Deepsound uses GTZAN dataset and an LSTM network to create a model for music genre recognition. On comparing the mean output distribution with the correct genre, the model gives almost 67% of accuracy.
For musical pattern extraction, MFCC feature dataset is used for audio analysis.
Scientists from Queen Mary University of London trained a neural net with over 6000 songs in a ballad, hip-hop, and dance to develop a neural network that achieves almost 75% accuracy in song classification.
Neural networks have advanced the state of music to whole new level where one would no longer require physical instruments or vocals to compose music. The future would see more complex models and data representations to understand the underlying melodic structure. This would help models create compelling artistic content on their own. Combination of music with technology would also foster a collaborative community consisting of artists, coders and deep learning researchers, leading to a tech-driven, yet artistic future.