Spark pragmatic concepts
You know what appeals the most to us developers? The ability to tap into the framework and the flexibility to extend it as per our needs. In today's world of abstraction and decoupling, this is taken care of using a variety of APIs that come out of the box.
We have talked enough about the latency issue the big data world was struggling with before Spark came and took the performance to the next level. Let's have a closer look to understand this latency problem a little better. The following diagram captures the execution of typical Hadoop processes and its intermediate steps:
Well, as depicted, Hadoop ecosystem leverages HDFS (a disk-based distributed stable storage) extensively to store the intermediate processing results:
- Job #1: This reads the data for processing from HDFS and writes its results to HDFS
- Job #2: This reads the interim processing results of job 1 from HDFS, processes, and writes the outcome to HDFS
While HDFS is a fault tolerant and persistent store...