Packt+ | Advance your knowledge in tech

You're reading from Machine Learning with Spark Develop intelligent, distributed machine learning systems

Product type Paperback

Published in Apr 2017

Publisher Packt

ISBN-13 9781785889936

Length 532 pages

Edition 2nd Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (2):

Manpreet Singh Ghotra

Rajdeep Dua

View More author details

AMQ Lab at Berkley Evaluated Spark, and RDDs were evaluated through a series of experiments on Amazon EC2 as well as benchmarks of user applications.

Algorithms used: Logistical Regression and k-means
Use case: First iteration, multiple iterations.

All the tests used m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. HDFS was for storage with 256 MB blocks. Refer to the following graph:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for Logistical Regression:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for K Means clustering algorithm.

The overall results show the following:

Spark outperforms Hadoop by up to 20 times in iterative machine learning and graph applications. The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects.
The applications written perform and scale well. Spark can speed up an analytics report that was running on Hadoop by 40 times.
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
Spark was be used to query a 1-TB dataset interactively with latencies of 5-7 seconds.

For more information, go to http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf.

Spark versus Hadoop for a SORT Benchmark--In 2014, the Databricks team participated in a SORT benchmark test (http://sortbenchmark.org/). This was done on a 100-TB dataset. Hadoop was running in a dedicated data center and a Spark cluster of over 200 nodes was run on EC2. Spark was run on HDFS distributed storage.

Spark was 3 times faster than Hadoop and used 10 times fewer machines. Refer to the following graph: