2.5. SPARQL on Apache Spark
2.5.1. Apache Spark
Apache Spark [ZAH 10] is a cluster computing engine which can be understood as a main memory extension of the MapReduce model, enabling parallel computations on unreliable machines and automatic locality-aware scheduling, fault tolerance and load balancing. While both Spark and Hadoop are based on a data flow computation model, Spark is more efficient than Hadoop for applications requiring the frequent reuse of working data sets across multiple parallel operations. This efficiency is mainly due to two complementary distributed main memory data abstractions, as shown in Figure 2.6: (i) Resilient Distributed Data sets (RDD) [ZAH 12], a distributed, lineage-supported, fault-tolerant memory data abstraction for in-memory computations (when Hadoop is mainly disk-based) and (ii) Data Frames (DF), a compressed and schema-enabled data abstraction. Both data abstractions ease the programming task by natively supporting a subset of relational operators...