Submitting applications to a cluster
This recipe shows how to run an application on distributed clusters. An application is launched on a set of machines using an external service called a cluster manager. There is a wide variety of cluster managers such as Hadoop YARN, Apache Mesos, and Spark's own built-in standalone cluster manager. Spark provides a single tool for submitting jobs across all cluster managers, called spark-submit. Through various options, spark-submit can connect to different cluster managers and control how many resources your application gets.
Getting ready
To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.
How to do it…
- Let's create a word count application:
package org.apache.spark.programs object WordCount{ def main(args:Array[String]) { val conf = new SparkConf conf.setAppName("WordCount") val sc = new SparkContext(conf) val input = sc.parallelize(Array("this,is,a,ball","it,is,a,cat","john,is, in,town,hall")) val words = input.flatMap{record => record.split(",")} val wordPairs = words.map(word => (word,1)) val wordCounts = wordPairs.reduceByKey{(a,b) => a+b} val result = wordCounts.collect println("Displaying the WordCounts:") result.foreach(println)
- Submit the application to Spark's standalone cluster manager:
spark-submit --class org.apache.spark.programs.WordCount --master spark://master:7077 WordCount.jar
- Submit the application to YARN:
spark-submit --class org.apache.spark.programs.WordCount --master yarn WordCount.jar
- Submit the application to Mesos:
spark-submit --class org.apache.spark.programs.WordCount --master mesos://mesos-master:5050 WordCount.jar
How it works…
When spark-submit
is called with the --master
flag as spark://master:7077
submits the application to Spark's standalone cluster. Invoking with the --master
flag as yarn
runs the application in the YARN cluster, whereas specifying the --master
flag as mesos://mesos-master:5050
runs the application on Mesos
cluster.
There's more…
Whenever spark-submit
is invoked, it launches the driver program. This driver program contacts the cluster manager and requests resources to launch executors. Once the executors are launched by the cluster manager, the driver runs through the user application. It delegates the work to executors in the form of tasks. When the driver's main()
method exits, it will terminate the executors and releases resources from the cluster manager. spark-submit
provides various options as well to control specific details.
See also
For more information on submitting applications to a cluster and the various options provided by Spark-submit, please visit: http://spark.apache.org/docs/latest/submitting-applications.html. Also, for detailed information about the different cluster managers, please refer to the following:
Also, to learn in details about the different cluster managers, please refer: