Datasets to experiment with on your own
As in the previous chapter, we will be using datasets from the UCI Machine Learning Repository, in particular the bike-sharing dataset (a regression problem) and Forest Covertype Data (a multiclass classification problem).
If you have not done so before or if you need to download both the datasets again, you will need a couple of functions defined in the Datasets to try the real thing yourself section of Chapter 2, Scalable Learning in Scikit-learn. The needed functions are unzip_from_UCI
and gzip_from_UCI
. Both have a Python connect to the UCI repository; download a compressed file and unzip it in the working Python directory. If you call the functions from an IPython cell, you will find the necessary new directories and files exactly where IPython will look for them.
In case the functions do not work for you, never mind; we will provide you with the link for a direct download. After that, all you will have to do is unpack the data in the current...