Spark architecture
Apache Spark is designed to simplify the laborious, and sometimes error prone task of highly-parallelized, distributed computing. To understand how it does this, let's explore its history and identify what Spark brings to the table.
History of Spark
Apache Spark implements a type of data parallelism that seeks to improve upon the MapReduce paradigm popularized by Apache Hadoop. It extended MapReduce in four key areas:
- Improved programming model: Spark provides a higher level of abstraction through its APIs than Hadoop; creating a programming model that significantly reduces the amount of code that must be written. By introducing a fluent, side-effect-free, function-oriented API, Spark makes it possible to reason about an analytic in terms of its transformations and actions, rather than just sequences of mappers and reducers. This makes it easier to understand and debug.
- Introduces workflow: Rather than chaining jobs together (by persisting results to disk...