How do Spark applications work?
A Spark application runs on a Spark cluster, which is a connected group of nodes. These nodes can be virtual machines (VMs) or bare-metal servers. In terms of Spark architecture, there is one driver node and one to n executors that run on your Spark cluster. The driver will control the executors and provide instructions (defined in your Spark application) to the executors. Generally, the driver never actually touches the data you are processing. The executors are where data is manipulated, given instructions from the driver. This is depicted in the following diagram:
Figure 3.1 – Spark driver and executor architecture
Note that the following calculations assume linear scalability, which is not always the case. The actual gain from distributing the work across many nodes depends on the nature of the data and the transformations applied to the data.
On open source Spark, you can configure the number of executors...