Introduction
In this chapter, we'll be looking at how to bundle our Spark application and deploy it on various distributed environments.
As we discussed earlier in Chapter 3, Loading and Preparing Data – DataFrame the foundation of Spark is the RDD. From a programmer's perspective, the composability of RDDs such as a regular Scala collection is a huge advantage. RDD wraps three vital (and two subsidiary) pieces of information that help in reconstruction of data. This enables fault tolerance. The other major advantage is that while the processing of RDDs could be composed into hugely complex graphs using RDD operations, the entire flow of data itself is not very difficult to reason with.
Other than optional optimization attributes, such as data location, an RDD at its core wraps only three vital pieces of information:
The dependent/parent RDD (empty if not available)
The number of partitions
The function that needs to be applied to each element of the RDD
Spark spawns one task per partition. So...