AMQ Lab at Berkley Evaluated Spark, and RDDs were evaluated through a series of experiments on Amazon EC2 as well as benchmarks of user applications.
- Algorithms used: Logistical Regression and k-means
- Use case: First iteration, multiple iterations.
All the tests used m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. HDFS was for storage with 256 MB blocks. Refer to the following graph:
The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for Logistical Regression:
The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for K Means clustering algorithm.
The overall results show the following:
- Spark outperforms Hadoop by up to 20 times in iterative machine learning and graph applications. The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects.
- The applications written perform and scale well. Spark can speed up an analytics report that was running on Hadoop by 40 times.
- When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
- Spark was be used to query a 1-TB dataset interactively with latencies of 5-7 seconds.
Spark versus Hadoop for a SORT Benchmark--In 2014, the Databricks team participated in a SORT benchmark test (http://sortbenchmark.org/). This was done on a 100-TB dataset. Hadoop was running in a dedicated data center and a Spark cluster of over 200 nodes was run on EC2. Spark was run on HDFS distributed storage.
Spark was 3 times faster than Hadoop and used 10 times fewer machines. Refer to the following graph: