You're reading from Spark Cookbook With over 60 recipes on Spark, covering Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries this is the perfect Spark book to always have by your side

Product type Paperback

Published in Jul 2015

Publisher

ISBN-13 9781783987061

Length 226 pages

Edition 1st Edition

Tools

Apache Spark

Concepts

Data Analysis

Author (1):

Rishi Yadav

View More author details

Table of Contents (14) Chapters

Preface

1. Getting Started with Apache Spark FREE CHAPTER

2. Developing Applications with Spark

3. External Data Sources

4. Spark SQL

5. Spark Streaming

6. Getting Started with Machine Learning Using MLlib

7. Supervised Learning with MLlib – Regression

8. Supervised Learning with MLlib – Classification

9. Unsupervised Learning with MLlib

10. Recommender Systems

11. Graph Processing Using GraphX

12. Optimizations and Performance Tuning

Index

Launching Spark on Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:

On-demand delivery of IT resources via the Internet
The provision of as many instances as you like
Payment for the hours you use instances like your utility bill
No setup cost, no installation, and no overhead at all
When you no longer need instances, you either shut down or terminate and walk away
The availability of these instances on all familiar operating systems

EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.

Getting ready

The spark-ec2 script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.

Before you start, you need to do the following things:

Log in to the Amazon AWS account (http://aws.amazon.com).
Click on Security Credentials under your account name in the top-right corner.
Click on Access Keys and Create New Access Key:
Note down the access key ID and secret access key.
Now go to Services | EC2.
Click on Key Pairs in left-hand menu under NETWORK & SECURITY.
Click on Create Key Pair and enter kp-spark as key-pair name:
Download the private key file and copy it in the /home/hduser/keypairs folder.
Set permissions on key file to 600.

Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values):

$ echo "export AWS_ACCESS_KEY_ID=\"AKIAOD7M2LOWATFXFKQ\"" >> /home/hduser/.bashrc
$ echo "export AWS_SECRET_ACCESS_KEY=\"+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q\"" >> /home/hduser/.bashrc
$ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc

How to do it...

Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:
```
$ cd /home/hduser
$ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
```
Launch the cluster with the example value:
```
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2  -s 3 launch spark-cluster
```
Note
- <key-pair>: This is the name of EC2 key-pair created in AWS
- <key-file>: This is the private key file you downloaded
- <num-slaves>: This is the number of slave nodes to launch
- <cluster-name>: This is the name of the cluster
Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:
```
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2  -s 3 launch spark-cluster
```

If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space):

$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster

If you use Amazon spot instances, here's the way to do it:
```
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2  -s 3 launch spark-cluster
```
Note
Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).
After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.
Check the status of the cluster:
Now, to access the Spark cluster on EC2, let's connect to the master node using secure shell protocol (SSH):
```
$ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem  login spark-cluster
```
You should get something like the following:

Check directories in the master node and see what they do:

Directory	Description
`ephemeral-hdfs`	This is the Hadoop instance for which data is ephemeral and gets deleted when you stop or restart the machine.
`persistent-hdfs`	Each node has a very small amount of persistent storage (approximately 3 GB). If you use this instance, data will be retained in that space.
`hadoop-native`	These are native libraries to support Hadoop, such as snappy compression libraries.
`Scala`	This is Scala installation.
`shark`	This is Shark installation (Shark is no longer supported and is replaced by Spark SQL).
`spark`	This is Spark installation
`spark-ec2`	These are files to support this cluster deployment.
`tachyon`	This is Tachyon installation

Check the HDFS version in an ephemeral instance:

$ ephemeral-hdfs/bin/hadoop version
Hadoop 2.0.0-chd4.2.0

Check the HDFS version in persistent instance with the following command:
```
$ persistent-hdfs/bin/hadoop version
Hadoop 2.0.0-chd4.2.0
```
Change the configuration level in logs:
```
$ cd spark/conf
```
The default log level information is too verbose, so let's change it to Error:
1. Create the log4.properties file by renaming the template:
```
$ mv log4j.properties.template log4j.properties
```
2. Open log4j.properties in vi or your favorite editor:
```
$ vi log4j.properties
```
3. Change second line from | log4j.rootCategory=INFO, console to | log4j.rootCategory=ERROR, console.
Copy the configuration to all slave nodes after the change:
```
$ spark-ec2/copydir spark/conf
```
You should get something like this:
Destroy the Spark cluster:
```
$ spark-ec2 destroy spark-cluster
```