From Hadoop MapReduce to Spark
Let's take a look at the journey from MapReduce to Spark.
Problems with Hadoop MapReduce
Even though MapReduce provides a suitable programming model for batch data processing, it does not perform well with real-time data processing. When it comes to iterative machine learning algorithms, it is necessary to carry information across iterations. Moreover, an intermediate outcome needs to be persisted during each iteration. Therefore, it is necessary to store and retrieve temporary data from the Hadoop Distributed File System (HDFS) very frequently, which incurs significant performance degradation.
Machine learning algorithms that can be written in a certain form of summation (algorithms that fit in the statistical query model) can be implemented in the MapReduce programming model. However, some of the machine learning algorithms are hard to implement by adhering to the MapReduce programming paradigm. MapReduce cannot be applied if there are any computational dependencies between the data.
Therefore, this constrained programming model is a barrier for Apache Mahout as it can limit the number of supported distributed algorithms.
In-memory data processing with Spark and H2O
Apache Spark is a large-scale scalable data processing framework, which claims to be 100 times faster than Hadoop MapReduce when in memory and 10 times faster in disk, has a distributed memory-based architecture. H2O is an open source, parallel processing engine for machine learning by 0xdata.
As a solution to the problems of the Hadoop MapReduce approach mentioned previously, Apache Mahout is working on integrating Apache Spark and H2O as the backend integration (with the Mahout Math library).
Why is Mahout shifting from Hadoop MapReduce to Spark?
With Spark, there can be better support for iterative machine learning algorithms using the in-memory approach. In-memory applications are self-optimizing. An algebraic expression optimizer is used for distributed linear algebra. One significant example is the Distributed Row Matrix (DRM), which is a huge matrix partitioned by rows.
Further, programming with Spark is easier than programming with MapReduce because Spark decouples the machine learning logic from the distributed backend. Accordingly, the distribution is hidden from the machine learning API users. This can be used like R or MATLAB.