Summary
In this chapter, we explored how to execute data-intensive jobs on a cluster of machines to achieve parallel processing. Parallel processing is important for large-scale data, which is also known as big data. We started by evaluating the different cluster options available for data processing. We provided a comparative analysis of Hadoop MapReduce and Apache Spark, which are the two main competing platforms for clusters. The analysis showed that Apache Spark has more flexibility in terms of supported languages and cluster management systems, and it outperforms Hadoop MapReduce for real-time data processing because of its in-memory data processing model.
Once we had established that Apache Spark is the most appropriate choice for a variety of data processing applications, we started looking into its fundamental data structure, which is the RDD. We discussed how to create RDDs from different sources of data and introduced two types of operations, transformations and actions...