Launching Spark on Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:
- On-demand delivery of IT resources via the Internet
- The provision of as many instances as you like
- Payment for the hours you use instances like your utility bill
- No setup cost, no installation, and no overhead at all
- When you no longer need instances, you either shut down or terminate and walk away
- The availability of these instances on all familiar operating systems
EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.
Getting ready
The spark-ec2
script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.
Before you start, you need to do the following things:
- Log in to the Amazon AWS account (http://aws.amazon.com).
- Click on Security Credentials under your account name in the top-right corner.
- Click on Access Keys and Create New Access Key:
- Note down the access key ID and secret access key.
- Now go to Services | EC2.
- Click on Key Pairs in left-hand menu under NETWORK & SECURITY.
- Click on Create Key Pair and enter
kp-spark
as key-pair name: - Download the private key file and copy it in the
/home/hduser/keypairs folder
. - Set permissions on key file to
600
. - Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values):
$ echo "export AWS_ACCESS_KEY_ID=\"AKIAOD7M2LOWATFXFKQ\"" >> /home/hduser/.bashrc $ echo "export AWS_SECRET_ACCESS_KEY=\"+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q\"" >> /home/hduser/.bashrc $ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc
How to do it...
- Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:
$ cd /home/hduser $ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
- Launch the cluster with the example value:
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -s 3 launch spark-cluster
Note
<key-pair>
: This is the name of EC2 key-pair created in AWS<key-file>
: This is the private key file you downloaded<num-slaves>
: This is the number of slave nodes to launch<cluster-name>
: This is the name of the cluster
- Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2 -s 3 launch spark-cluster
- If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space):
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster
- If you use Amazon spot instances, here's the way to do it:
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2 -s 3 launch spark-cluster
Note
Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).
- After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.
- Check the status of the cluster:
- Now, to access the Spark cluster on EC2, let's connect to the master node using secure shell protocol (SSH):
$ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem login spark-cluster
You should get something like the following:
- Check directories in the master node and see what they do:
Directory
Description
ephemeral-hdfs
This is the Hadoop instance for which data is ephemeral and gets deleted when you stop or restart the machine.
persistent-hdfs
Each node has a very small amount of persistent storage (approximately 3 GB). If you use this instance, data will be retained in that space.
hadoop-native
These are native libraries to support Hadoop, such as snappy compression libraries.
Scala
This is Scala installation.
shark
This is Shark installation (Shark is no longer supported and is replaced by Spark SQL).
spark
This is Spark installation
spark-ec2
These are files to support this cluster deployment.
tachyon
This is Tachyon installation
- Check the HDFS version in an ephemeral instance:
$ ephemeral-hdfs/bin/hadoop version Hadoop 2.0.0-chd4.2.0
- Check the HDFS version in persistent instance with the following command:
$ persistent-hdfs/bin/hadoop version Hadoop 2.0.0-chd4.2.0
- Change the configuration level in logs:
$ cd spark/conf
- The default log level information is too verbose, so let's change it to Error:
- Create the
log4.properties
file by renaming the template:$ mv log4j.properties.template log4j.properties
- Open
log4j.properties
in vi or your favorite editor:$ vi log4j.properties
- Change second line from
| log4j.rootCategory=INFO, console
to| log4j.rootCategory=ERROR, console
.
- Create the
- Copy the configuration to all slave nodes after the change:
$ spark-ec2/copydir spark/conf
You should get something like this:
- Destroy the Spark cluster:
$ spark-ec2 destroy spark-cluster