In this chapter, we looked at how to build and prepare a dataset that will be used for training. First, we looked at existing datasets and explained how some are more suitable than others for a specific use case. We then looked at the LMD and the MSD, which are useful for their size and completeness, and datasets from the Magenta team, such as the MAESTRO dataset and the GMD. We also looked at external APIs such as Last.fm, which can be used to enrich existing datasets.
Then, we built a dance music dataset and used information contained in MIDI files to detect specific structures and instruments. We learned how to compute our results using multiprocessing and how to plot statistics about the resulting MIDI files.
After, we built a jazz dataset by extracting information from the LMD and using the Last.fm API to find the genre of each song. We also looked at how to find...