Cluster management
The Spark context, as you will see in many of the examples in this book, can be defined via a Spark configuration object and Spark URL. The Spark context connects to the Spark cluster manager, which then allocates resources across the worker nodes for the application. The cluster manager allocates executors across the cluster worker nodes. It copies the application JAR file to the workers and finally allocates tasks.
The following subsections describe the possible Apache Spark cluster manager options available at this time.
Local
By specifying a Spark configuration local URL, it is possible to have the application run locally. By specifying local[n]
, it is possible to have Spark use n threads to run the application locally. This is a useful development and test option because you can also test some sort of parallelization scenarios but keep all log files on a single machine.
Standalone
Standalone mode uses a basic cluster manager that is supplied with Apache Spark. The spark master URL will be as follows:
Spark://<hostname>:7077
Here,<hostname>
is the name of the host on which the Spark master is running. We have specified7077
as the port, which is the default value, but this is configurable. This simple cluster manager currently supports only FIFO (first-in-first-out) scheduling. You can contrive to allow concurrent application scheduling by setting the resource configuration options for each application; for instance, usingspark.core.max
to share cores between applications.
Apache YARN
At a larger scale, when integrating with Hadoop YARN, the Apache Spark cluster manager can be YARN and the application can run in one of two modes. If the Spark master value is set as yarn-cluster
, then the application can be submitted to the cluster and then terminated. The cluster will take care of allocating resources and running tasks. However, if the application master is submitted as yarn-client
, then the application stays alive during the life cycle of processing, and requests resources from YARN.
Apache Mesos
Apache Mesos is an open source system, for resource sharing across a cluster. It allows multiple frameworks, to share a cluster by managing and scheduling resources. It is a cluster manager, that provides isolation using Linux containers and allowing multiple systems such as Hadoop, Spark, Kafka, Storm, and more to share a cluster safely. It is highly scalable to thousands of nodes. It is a master/slave-based system and is fault tolerant, using Zookeeper for configuration management.
For a single master node Mesos cluster, the Spark master URL will be in this form:
mesos://<hostname>:5050
.
Here, <hostname>
is the hostname of the Mesos master server; the port is defined as 5050,
which is the default Mesos master port (this is configurable). If there are multiple Mesos master servers in a large-scale high availability Mesos cluster, then the Spark master URL would look as follows:
mesos://zk://<hostname>:2181
.
So, the election of the Mesos master server will be controlled by Zookeeper. The <hostname>
will be the name of a host in the Zookeeper quorum. Also, the port number, 2181
, is the default master port for Zookeeper.