Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Spark Cookbook

You're reading from   Spark Cookbook With over 60 recipes on Spark, covering Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries this is the perfect Spark book to always have by your side

Arrow left icon
Product type Paperback
Published in Jul 2015
Publisher
ISBN-13 9781783987061
Length 226 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Rishi Yadav Rishi Yadav
Author Profile Icon Rishi Yadav
Rishi Yadav
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Getting Started with Apache Spark 2. Developing Applications with Spark FREE CHAPTER 3. External Data Sources 4. Spark SQL 5. Spark Streaming 6. Getting Started with Machine Learning Using MLlib 7. Supervised Learning with MLlib – Regression 8. Supervised Learning with MLlib – Classification 9. Unsupervised Learning with MLlib 10. Recommender Systems 11. Graph Processing Using GraphX 12. Optimizations and Performance Tuning Index

Launching Spark on Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:

  • On-demand delivery of IT resources via the Internet
  • The provision of as many instances as you like
  • Payment for the hours you use instances like your utility bill
  • No setup cost, no installation, and no overhead at all
  • When you no longer need instances, you either shut down or terminate and walk away
  • The availability of these instances on all familiar operating systems

EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.

Getting ready

The spark-ec2 script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.

Before you start, you need to do the following things:

  1. Log in to the Amazon AWS account (http://aws.amazon.com).
  2. Click on Security Credentials under your account name in the top-right corner.
  3. Click on Access Keys and Create New Access Key:
    Getting ready
  4. Note down the access key ID and secret access key.
  5. Now go to Services | EC2.
  6. Click on Key Pairs in left-hand menu under NETWORK & SECURITY.
  7. Click on Create Key Pair and enter kp-spark as key-pair name:
    Getting ready
  8. Download the private key file and copy it in the /home/hduser/keypairs folder.
  9. Set permissions on key file to 600.
  10. Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values):
    $ echo "export AWS_ACCESS_KEY_ID=\"AKIAOD7M2LOWATFXFKQ\"" >> /home/hduser/.bashrc
    $ echo "export AWS_SECRET_ACCESS_KEY=\"+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q\"" >> /home/hduser/.bashrc
    $ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc
    

How to do it...

  1. Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:
    $ cd /home/hduser
    $ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
    
  2. Launch the cluster with the example value:
    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2  -s 3 launch spark-cluster
    

    Note

    • <key-pair>: This is the name of EC2 key-pair created in AWS
    • <key-file>: This is the private key file you downloaded
    • <num-slaves>: This is the number of slave nodes to launch
    • <cluster-name>: This is the name of the cluster
  3. Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:
    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2  -s 3 launch spark-cluster
    
  4. If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space):
    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster
    
  5. If you use Amazon spot instances, here's the way to do it:
    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2  -s 3 launch spark-cluster
    

    Note

    Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).

  6. After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.
    How to do it...
  7. Check the status of the cluster:
    How to do it...
  8. Now, to access the Spark cluster on EC2, let's connect to the master node using secure shell protocol (SSH):
    $ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem  login spark-cluster
    

    You should get something like the following:

    How to do it...
  9. Check directories in the master node and see what they do:

    Directory

    Description

    ephemeral-hdfs

    This is the Hadoop instance for which data is ephemeral and gets deleted when you stop or restart the machine.

    persistent-hdfs

    Each node has a very small amount of persistent storage (approximately 3 GB). If you use this instance, data will be retained in that space.

    hadoop-native

    These are native libraries to support Hadoop, such as snappy compression libraries.

    Scala

    This is Scala installation.

    shark

    This is Shark installation (Shark is no longer supported and is replaced by Spark SQL).

    spark

    This is Spark installation

    spark-ec2

    These are files to support this cluster deployment.

    tachyon

    This is Tachyon installation

  10. Check the HDFS version in an ephemeral instance:
    $ ephemeral-hdfs/bin/hadoop version
    Hadoop 2.0.0-chd4.2.0
    
  11. Check the HDFS version in persistent instance with the following command:
    $ persistent-hdfs/bin/hadoop version
    Hadoop 2.0.0-chd4.2.0
    
  12. Change the configuration level in logs:
    $ cd spark/conf
    
  13. The default log level information is too verbose, so let's change it to Error:
    How to do it...
    1. Create the log4.properties file by renaming the template:
      $ mv log4j.properties.template log4j.properties
      
    2. Open log4j.properties in vi or your favorite editor:
      $ vi log4j.properties
      
    3. Change second line from | log4j.rootCategory=INFO, console to | log4j.rootCategory=ERROR, console.
  14. Copy the configuration to all slave nodes after the change:
    $ spark-ec2/copydir spark/conf
    

    You should get something like this:

    How to do it...
  15. Destroy the Spark cluster:
    $ spark-ec2 destroy spark-cluster
    
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime