Big data puts data science projects under four points of view: volume (data quantity), velocity, variety, and veracity (is your data really representing what it should be or is it affected by some bias, distortion, or error?). The Scikit-learn package offers a range of classes and functions that will help you effectively work with data so large that it cannot entirely fit in the memory of a standard computer.
Before providing you with an overview of big data solutions, we have to create or import some datasets in order to give you a better idea of the scalability and performances of different algorithms. This will require about 1.5 gigabytes of your hard disk, which will be let free after the experiment.
(Not big data in itself—nowadays, it is hard to find computers with less than 4 GB of memory—yet, not even a toy dataset, it should provide...