Getting started with Apache Spark
The term big data can rightly feel vague and imprecise. What is the cut-off for considering any dataset big data? Is it 10 GB, 100 GB, 1 TB or more? One definition that I like is: big data is when the data cannot fit into the memory available in a single machine. For years, data scientists have been forced to sample large datasets, so they could fit into a single machine, but that started to change as parallel computing frameworks that are able to distribute the data into a cluster of machines made it possible to work with the dataset in its entirety, provided of course that the cluster had enough machines. At the same time, advances in cloud technologies made it possible to provision on demand a cluster of machines that are adapted to the size of the dataset.
Today, there are multiple frameworks (most of the time available as open source) that can provide robust, flexible parallel computing capabilities. Some of the most popular...