Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Spark Cookbook

You're reading from   Spark Cookbook With over 60 recipes on Spark, covering Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries this is the perfect Spark book to always have by your side

Arrow left icon
Product type Paperback
Published in Jul 2015
Publisher
ISBN-13 9781783987061
Length 226 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Rishi Yadav Rishi Yadav
Author Profile Icon Rishi Yadav
Rishi Yadav
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Getting Started with Apache Spark 2. Developing Applications with Spark FREE CHAPTER 3. External Data Sources 4. Spark SQL 5. Spark Streaming 6. Getting Started with Machine Learning Using MLlib 7. Supervised Learning with MLlib – Regression 8. Supervised Learning with MLlib – Classification 9. Unsupervised Learning with MLlib 10. Recommender Systems 11. Graph Processing Using GraphX 12. Optimizations and Performance Tuning Index

Deploying on a cluster with YARN

Yet another resource negotiator (YARN) is Hadoop's compute framework that runs on top of HDFS, which is Hadoop's storage layer.

YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application.

When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes.

While running Spark with YARN, each Spark executor is run as YARN container.

Getting ready

Running Spark on YARN requires a binary distribution of Spark that has YARN support. In both Spark installation recipes, we have taken care of it.

How to do it...

  1. To run Spark on YARN, the first step is to set the configuration:
    HADOOP_CONF_DIR: to write to HDFS
    YARN_CONF_DIR: to connect to YARN ResourceManager
    $ cd /opt/infoobjects/spark/conf (or /etc/spark)
    $ sudo vi spark-env.sh
    export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop
    export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop
    

    You can see this in the following screenshot:

    How to do it...
  2. The following command launches YARN Spark in the yarn-client mode:
    $ spark-submit --class path.to.your.Class --master yarn-client [options] <app jar> [app options]
    

    Here's an example:

    $ spark-submit --class com.infoobjects.TwitterFireHose --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 target/sparkio.jar 10
    
  3. The following command launches Spark shell in the yarn-client mode:
    $ spark-shell --master yarn-client
    
  4. The command to launch in the yarn-cluster mode is as follows:
    $ spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
    

    Here's an example:

    $ spark-submit --class com.infoobjects.TwitterFireHose --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 targe
    t/sparkio.jar 10
    

How it works…

Spark applications on YARN run in two modes:

  • yarn-client: Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from ResourceManager
  • yarn-cluster: Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node

The yarn-cluster mode is recommended for production deployments, while the yarn-client mode is good for development and debugging when you would like to see immediate output. There is no need to specify Spark master in either mode as it's picked from the Hadoop configuration, and the master parameter is either yarn-client or yarn-cluster.

The following figure shows how Spark is run with YARN in the client mode:

How it works…

The following figure shows how Spark is run with YARN in the cluster mode:

How it works…

In the YARN mode, the following configuration parameters can be set:

  • --num-executors: Configure how many executors will be allocated
  • --executor-memory: RAM per executor
  • --executor-cores: CPU cores per executor
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime