Scaling out with PySpark – predicting year of song release
To close, let us look at another example using PySpark. With this dataset, which is a subset of the Million Song dataset (Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida. University of Miami, 2011), the goal is to predict the year of a song's release based on the features of the track. The data is supplied as a comma-separated text file, which we can convert into an RDD using the Spark textFile()
function. As before in our clustering example, we also define a parsing function with a try…catch
block so that we do not fail on a single error in a large dataset:
>>> def parse_line(l): … try: … return l.split(",") … except: … print("error in processing {0}".format(l))
We then...