Spark v2.0 and beyond
Spark v2.0 and beyond has been the catalyst for a renaissance in data science! Datasets, DataFrames, ML pipelines, and new and improved algorithms in MLlib have paved the way for data wrangling at scale. I think Version 2.0 marks the spot where Spark turned into a mature framework. It could handle huge workloads in terms of the number of machines as well as the volume of data. The community update at the Spark Summit 2015 in San Francisco included a slide that showed the power of Spark:
The largest cluster-8,000 nodes (Tencent)
The largest single job-1 petabyte and more (Alibaba and Tencent)
The longest running job-1 petabyte and more for a week (Alibaba)
The top streaming intake-1 terabyte/hour (Janelia farm)
The largest shuffle-1 petabyte during sort benchmark (databricks)
Netflix uses Spark for ad-hoc query and experimentation; they have 1,500 and more Spark nodes with 100 terabyte memory, chugging through 15 petabyte and more of S3 data and 7 petabyte of Parquet
Tencent...