Summary
This chapter covered various topics that govern the functioning of RDD-like partitioning and then used advanced transformations and actions to achieve specific requirements. We also looked at the limitations of sharing variables across executor nodes and how it can be achieved using broadcast variables and accumulators.
The next chapter introduces Spark SQL and related concepts like datafame, dataset, UDF and so on. We'll also discuss SQLContext
and the newly introduced SparkSession
and how its introduction has simplified the whole process of dealing with the Hive metastore.