Spark follows a master-slave architecture, as it allows it to scale on demand. Spark's architecture has two main components:
- Driver Program: A driver program is where a user writes Spark code using either Scala, Java, Python, or R APIs. It is responsible for launching various parallel operations of the cluster.
- Executor: Executor is the Java Virtual Machine (JVM) that runs on a worker node of the cluster. Executor provides hardware resources for running the tasks launched by the driver program.
As soon as a Spark job is submitted, the driver program launches various operation on each executor. Driver and executors together make an application.
The following diagram demonstrates the relationships between Driver, Workers, and Executors. As the first step, a driver process parses the user code (Spark Program) and creates multiple executors on each worker node. The driver process not only forks the executors on work machines, but also sends tasks to these executors to run the entire application in parallel.
Once the computation is completed, the output is either sent to the driver program or saved on to the file system: