Broadcast variables are shared variables across all executors. Broadcast variables are created once in the Driver and then are read only on executors. While it is simple to understand simple datatypes broadcasted, such as an Integer, broadcast is much bigger than simple variables conceptually. Entire datasets can be broadcasted in a Spark cluster so that executors have access to the broadcasted data. All the tasks running within an executor all have access to the broadcast variables.
Broadcast uses various optimized methods to make the broadcasted data accessible to all executors. This is an important challenge to solve as if the size of the datasets broadcasted is significant, you cannot expect 100s or 1000s of executors to connect to the Driver and pull the dataset. Rather, the executors pull the data via HTTP connection and the more recent addition which...