Chapter 7. Faster than Hadoop - Spark with R
In Chapter 4, Hadoop and MapReduce Framework for R, you learned about Hadoop and MapReduce frameworks that enable users to process and analyze massive datasets stored in the Hadoop Distributed File System (HDFS). We launched a multi-node Hadoop cluster to run some heavy data crunching jobs using R language which would not be otherwise achievable on an average personal computer with any of the R distributions installed. We also said that although Hadoop is extremely powerful, it is generally recommended for data that greatly exceeds the memory limitations due to its rather slow processing. In this chapter we would like to present Apache Spark engine–a faster way to process and analyze Big Data. After reading this chapter, you should be able to:
- Understand and appreciate Spark characteristics and functionalities
- Deploy a fully-operational, multi-node Microsoft Azure HDInsight cluster with Hadoop, Spark, and Hive fully-configured...