We're now ready to launch a small Spark cluster by changing into the ec2 directory and then running the cluster launch command:
$ cd ec2
$ ./spark-ec2 --key-pair=rd_spark-user1 --identity-file=spark.pem
--region=us-east-1 --zone=us-east-1a launch my-spark-cluster
This will launch a new Spark cluster called test-cluster with one master and one slave node of instance type m3.medium. This cluster will be launched with a Spark version built for Hadoop 2. The key pair name we used is spark, and the key pair file is spark.pem (if you gave the files different names or have an existing AWS key pair, use that name instead).
It might take quite a while for the cluster to fully launch and initialize. You should see something like the following immediately after running the launch command:
Setting up security groups...
Creating security group my-spark-cluster-master
Creating security group my-spark-cluster-slaves
Searching for existing cluster my-spark-cluster in region
us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 1 slave in us-east-1a, regid = r-5a893af2
Launched master in us-east-1a, regid = r-39883b91
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state...........
Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
1.amazonaws.com port 22: Connection refused
Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
1.amazonaws.com port 22: Connection refused
Warnig: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
1.amazonaws.com port 22: Connection refused
Cluster is now in 'ssh-ready' state. Waited 510 seconds.
If the cluster has launched successfully, you should eventually see a console output similar to the following listing:
./tachyon/setup.sh: line 5: /root/tachyon/bin/tachyon:
No such file or directory
./tachyon/setup.sh: line 9: /root/tachyon/bin/tachyon-start.sh:
No such file or directory
[timing] tachyon setup: 00h 00m 01s
Setting up rstudio
spark-ec2/setup.sh: line 110: ./rstudio/setup.sh:
No such file or directory
[timing] rstudio setup: 00h 00m 00s
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
ec2-52-91-214-206.compute-1.amazonaws.com
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond: [ OK ]
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond: [ OK ]
Connection to ec2-52-91-214-206.compute-1.amazonaws.com closed.
Shutting down GANGLIA gmetad: [FAILED]
Starting GANGLIA gmetad: [ OK ]
Stopping httpd: [FAILED]
Starting httpd: httpd: Syntax error on line 154 of /etc/httpd
/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so
into server: /etc/httpd/modules/mod_authz_core.so: cannot open
shared object file: No such file or directory [FAILED]
[timing] ganglia setup: 00h 00m 03s
Connection to ec2-52-90-110-128.compute-1.amazonaws.com closed.
Spark standalone cluster started at
http://ec2-52-90-110-128.compute-1.amazonaws.com:8080
Ganglia started at http://ec2-52-90-110-128.compute-
1.amazonaws.com:5080/ganglia
Done!
ubuntu@ubuntu:~/work/spark-1.6.0-bin-hadoop2.6/ec2$
This will create two VMs - Spark Master and Spark Slave of type m1.large as shown in the following screenshot :
To test whether we can connect to our new cluster, we can run the following command:
$ ssh -i spark.pem root@ ec2-52-90-110-128.compute-1.amazonaws.com
Remember to replace the public domain name of the master node (the address after root@ in the preceding command) with the correct Amazon EC2 public domain name that will be shown in your console output after launching the cluster.
You can also retrieve your cluster's master public domain name by running this line of code:
$ ./spark-ec2 -i spark.pem get-master test-cluster
After successfully running the ssh command, you will be connected to your Spark master node in EC2, and your terminal output should match the following screenshot:
We can test whether our cluster is correctly set up with Spark by changing into the Spark directory and running an example in the local mode:
$ cd spark
$ MASTER=local[2] ./bin/run-example SparkPi
You should see output similar to what you would get on running the same command on your local computer:
...
14/01/30 20:20:21 INFO SparkContext: Job finished: reduce at
SparkPi.scala:35, took 0.864044012 s
Pi is roughly 3.14032
...
Now that we have an actual cluster with multiple nodes, we can test Spark in the cluster mode. We can run the same example on the cluster, using our one slave node by passing in the master URL instead of the local version:
$ MASTER=spark:// ec2-52-90-110-128.compute-
1.amazonaws.com:7077 ./bin/run-example SparkPi
Note that you will need to substitute the preceding master domain name with the correct domain name for your specific cluster.
Again, the output should be similar to running the example locally; however, the log messages will show that your driver program has connected to the Spark master:
...
14/01/30 20:26:17 INFO client.Client$ClientActor: Connecting to
master spark://ec2-54-220-189-136.eu-
west-1.compute.amazonaws.com:7077
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend:
Connected to Spark cluster with app ID app-20140130202617-0001
14/01/30 20:26:17 INFO client.Client$ClientActor: Executor added:
app-20140130202617-0001/0 on worker-20140130201049-
ip-10-34-137-45.eu-west-1.compute.internal-57119
(ip-10-34-137-45.eu-west-1.compute.internal:57119) with 1 cores
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend:
Granted executor ID app-20140130202617-0001/0 on hostPort
ip-10-34-137-45.eu-west-1.compute.internal:57119 with 1 cores,
2.4 GB RAM
14/01/30 20:26:17 INFO client.Client$ClientActor:
Executor updated: app-20140130202617-0001/0 is now RUNNING
14/01/30 20:26:18 INFO spark.SparkContext: Starting job: reduce at
SparkPi.scala:39
...
Feel free to experiment with your cluster. Try out the interactive console in Scala, for example:
$ ./bin/spark-shell --master spark:// ec2-52-90-110-128.compute-
1.amazonaws.com:7077
Once you've finished, type exit to leave the console. You can also try the PySpark console by running the following command:
$ ./bin/pyspark --master spark:// ec2-52-90-110-128.compute-
1.amazonaws.com:7077
You can use the Spark Master web interface to see the applications registered with the master. To load the Master Web UI, navigate to ec2-52-90-110-128.compute-1.amazonaws.com:8080 (again, remember to replace this domain name with your own master domain name).
Remember that you will be charged by Amazon for usage of the cluster. Don't forget to stop or terminate this test cluster once you're done with it. To do this, you can first exit the ssh session by typing exit to return to your own local system and then run the following command:
$ ./ec2/spark-ec2 -k spark -i spark.pem destroy test-cluster
You should see the following output:
Are you sure you want to destroy the cluster test-cluster?
The following instances will be terminated:
Searching for existing cluster test-cluster...
Found 1 master(s), 1 slaves
> ec2-54-227-127-14.compute-1.amazonaws.com
> ec2-54-91-61-225.compute-1.amazonaws.com
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster test-cluster (y/N): y
Terminating master...
Terminating slaves...
Hit Y and then Enter to destroy the cluster.
Congratulations! You've just set up a Spark cluster in the cloud, run a fully parallel example program on this cluster, and terminated it. If you would like to try out any of the example code in the subsequent chapters (or your own Spark programs) on a cluster, feel free to experiment with the Spark EC2 scripts and launch a cluster of your chosen size and instance profile. (Just be mindful of the costs and remember to shut it down when you're done!)