Introducing Apache Spark
Hadoop and MR have been around for 10 years and have proven to be the best solution to process massive data with high performance. However, MR lacked performance in iterative computing where the output between multiple MR jobs had to be written to HDFS. In a single MR job, it lacked performance because of the drawbacks of the MR framework.
Let's take a look at the history of computing trends to understand how computing paradigms have changed over the last two decades.
The trend has been to Reference the URI when the network was cheaper (in 1990), Replicate when storage became cheaper (in 2000), and Recompute when memory became cheaper (in 2010), as shown in Figure 2.5:
Note
So, what really changed over a period of time?
Tape is dead, disk has become tape, and SSD has almost become the disk. Now, caching data in RAM is the current trend.
Let's understand why memory-based computing is important and how it provides significant performance...