Chapter 5. Iterative Computation with Spark
In the previous chapter, we saw how Samza can enable near real-time stream data processing within Hadoop. This is quite a step away from the traditional batch processing model of MapReduce, but still keeps with the model of providing a well-defined interface against which business logic tasks can be implemented. In this chapter we will explore Apache Spark, which can be viewed both as a framework on which applications can be built as well as a processing framework in its own right. Not only are applications being built on Spark, but entire components within the Hadoop ecosystem are also being reimplemented to use Spark as their underlying processing framework. In particular, we will cover the following topics:
- What Spark is and how its core system can run on YARN
- The data model provided by Spark that enables hugely scalable and highly efficient data processing
- The breadth of additional Spark components and related projects
It's important...