Shared variables
We touched upon shared variables in Chapter 2, Transformations and Actions with Spark RDDs, we did not go into more details as this is considered to be a slightly advanced topic with lots of nuances around what can and cannot be shared. To briefly recap we discussed two types of Shared Variables:
- Broadcast variables
- Accumulators
Broadcast variables
Spark is an MPP architecture where multiple nodes work in parallel to achieve operations in an optimal way. As the name indicates, you might want to achieve a state where each node has its own copy of the input/interim data set, and hence broadcast that across the cluster. From previous knowledge we know that Spark does some internal broadcasting of data while executing various actions. When you run an action on Spark, the RDD is transformed into a series of stages consisting of TaskSets, which are then executed in parallel on the executors. Data is distributed using shuffle operations and the common data needed by the tasks within...