Shared variables – broadcast variables and accumulators
While working in distributed compute programs and modules, where the code executes on different nodes and/or different workers, a lot of time a need arises to share data across the execution units in the distributed execution setup. Thus Spark has the concept of shared variables. The shared variables are used to share information between the parallel executing tasks across various workers or the tasks and the drivers. Spark supports two types of shared variable:
- Broadcast variables
- Accumulators
In the following sections, we will look at these two types of Spark variables, both conceptually and pragmatically.
Broadcast variables
These are the variables that the programmer intends to share to all execution units throughout the cluster. Though they sound very simple to work with, there are a few aspects the programmers need to be cognizant of while working with broadcast variables: they need to be able to fit in the memory of each node in the...