Search icon CANCEL
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
€8.99 | ALL EBOOKS & VIDEOS
Save more on purchases! Buy 2 and save 10%, Buy 3 and save 15%, Buy 5 and save 20%
Spark Cookbook
Spark Cookbook

Spark Cookbook: With over 60 recipes on Spark, covering Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries this is the perfect Spark book to always have by your side

By Rishi Yadav
€37.99 €25.99
Book Jul 2015 226 pages 1st Edition
eBook
€28.99 €8.99
Print
€37.99 €25.99
Subscription
€14.99 Monthly
eBook
€28.99 €8.99
Print
€37.99 €25.99
Subscription
€14.99 Monthly

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Spark Cookbook

Chapter 1. Getting Started with Apache Spark

In this chapter, we will set up Spark and configure it. This chapter is divided into the following recipes:

  • Installing Spark from binaries

  • Building the Spark source code with Maven

  • Launching Spark on Amazon EC2

  • Deploying Spark on a cluster in standalone mode

  • Deploying Spark on a cluster with Mesos

  • Deploying Spark on a cluster with YARN

  • Using Tachyon as an off-heap storage layer

Introduction


Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.

Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases.

Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.

Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.

Though Spark is written in Scala, and this book only focuses on recipes in Scala, Spark also supports Java and Python.

Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements.

The following figure shows the Spark ecosystem:

The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management feature that Spark can seamlessly use.

Installing Spark from binaries


Spark can be either built from the source code or precompiled binaries can be downloaded from http://spark.apache.org. For a standard use case, binaries are good enough, and this recipe will focus on installing Spark using binaries.

Getting ready

All the recipes in this book are developed using Ubuntu Linux but should work fine on any POSIX environment. Spark expects Java to be installed and the JAVA_HOME environment variable to be set.

In Linux/Unix systems, there are certain standards for the location of files and directories, which we are going to follow in this book. The following is a quick cheat sheet:

Directory

Description

/bin

Essential command binaries

/etc

Host-specific system configuration

/opt

Add-on application software packages

/var

Variable data

/tmp

Temporary files

/home

User home directories

How to do it...

At the time of writing this, Spark's current version is 1.4. Please check the latest version from Spark's download page at http://spark.apache.org/downloads.html. Binaries are developed with a most recent and stable version of Hadoop. To use a specific version of Hadoop, the recommended approach is to build from sources, which will be covered in the next recipe.

The following are the installation steps:

  1. Open the terminal and download binaries using the following command:

    $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.4.tgz
    
  2. Unpack binaries:

    $ tar -zxf spark-1.4.0-bin-hadoop2.4.tgz
    
  3. Rename the folder containing binaries by stripping the version information:

    $ sudo mv spark-1.4.0-bin-hadoop2.4 spark
    
  4. Move the configuration folder to the /etc folder so that it can be made a symbolic link later:

    $ sudo mv spark/conf/* /etc/spark
    
  5. Create your company-specific installation directory under /opt. As the recipes in this book are tested on infoobjects sandbox, we are going to use infoobjects as directory name. Create the /opt/infoobjects directory:

    $ sudo mkdir -p /opt/infoobjects
    
  6. Move the spark directory to /opt/infoobjects as it's an add-on software package:

    $ sudo mv spark /opt/infoobjects/
    
  7. Change the ownership of the spark home directory to root:

    $ sudo chown -R root:root /opt/infoobjects/spark
    
  8. Change permissions of the spark home directory, 0755 = user:read-write-execute group:read-execute world:read-execute:

    $ sudo chmod -R 755 /opt/infoobjects/spark
    
  9. Move to the spark home directory:

    $ cd /opt/infoobjects/spark
    
  10. Create the symbolic link:

    $ sudo ln -s /etc/spark conf
    
  11. Append to PATH in .bashrc:

    $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc
    
  12. Open a new terminal.

  13. Create the log directory in /var:

    $ sudo mkdir -p /var/log/spark
    
  14. Make hduser the owner of the Spark log directory.

    $ sudo chown -R hduser:hduser /var/log/spark
    
  15. Create the Spark tmp directory:

    $ mkdir /tmp/spark
    
  16. Configure Spark with the help of the following command lines:

    $ cd /etc/spark
    $ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh
    $ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh
    $ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh
    $ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh
    

Building the Spark source code with Maven


Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option:

  • Compiling for a specific Hadoop version

  • Adding the Hive integration

  • Adding the YARN integration

Getting ready

The following are the prerequisites for this recipe to work:

  • Java 1.6 or a later version

  • Maven 3.x

How to do it...

The following are the steps to build the Spark source code with Maven:

  1. Increase MaxPermSize for heap:

    $ echo "export _JAVA_OPTIONS=\"-XX:MaxPermSize=1G\""  >> /home/hduser/.bashrc
    
  2. Open a new terminal window and download the Spark source code from GitHub:

    $ wget https://github.com/apache/spark/archive/branch-1.4.zip
    
  3. Unpack the archive:

    $ gunzip branch-1.4.zip
    
  4. Move to the spark directory:

    $ cd spark
    
  5. Compile the sources with these flags: Yarn enabled, Hadoop version 2.4, Hive enabled, and skipping tests for faster compilation:

    $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package
    
  6. Move the conf folder to the etc folder so that it can be made a symbolic link:

    $ sudo mv spark/conf /etc/
    
  7. Move the spark directory to /opt as it's an add-on software package:

    $ sudo mv spark /opt/infoobjects/spark
    
  8. Change the ownership of the spark home directory to root:

    $ sudo chown -R root:root /opt/infoobjects/spark
    
  9. Change the permissions of the spark home directory 0755 = user:rwx group:r-x world:r-x:

    $ sudo chmod -R 755 /opt/infoobjects/spark
    
  10. Move to the spark home directory:

    $ cd /opt/infoobjects/spark
    
  11. Create a symbolic link:

    $ sudo ln -s /etc/spark conf
    
  12. Put the Spark executable in the path by editing .bashrc:

    $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc
    
  13. Create the log directory in /var:

    $ sudo mkdir -p /var/log/spark
    
  14. Make hduser the owner of the Spark log directory:

    $ sudo chown -R hduser:hduser /var/log/spark
    
  15. Create the Spark tmp directory:

    $ mkdir /tmp/spark
    
  16. Configure Spark with the help of the following command lines:

    $ cd /etc/spark
    $ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh
    $ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh
    $ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh
    $ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh
    

Launching Spark on Amazon EC2


Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:

  • On-demand delivery of IT resources via the Internet

  • The provision of as many instances as you like

  • Payment for the hours you use instances like your utility bill

  • No setup cost, no installation, and no overhead at all

  • When you no longer need instances, you either shut down or terminate and walk away

  • The availability of these instances on all familiar operating systems

EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.

Getting ready

The spark-ec2 script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.

Before you start, you need to do the following things:

  1. Log in to the Amazon AWS account (http://aws.amazon.com).

  2. Click on Security Credentials under your account name in the top-right corner.

  3. Click on Access Keys and Create New Access Key:

  4. Note down the access key ID and secret access key.

  5. Now go to Services | EC2.

  6. Click on Key Pairs in left-hand menu under NETWORK & SECURITY.

  7. Click on Create Key Pair and enter kp-spark as key-pair name:

  8. Download the private key file and copy it in the /home/hduser/keypairs folder.

  9. Set permissions on key file to 600.

  10. Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values):

    $ echo "export AWS_ACCESS_KEY_ID=\"AKIAOD7M2LOWATFXFKQ\"" >> /home/hduser/.bashrc
    $ echo "export AWS_SECRET_ACCESS_KEY=\"+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q\"" >> /home/hduser/.bashrc
    $ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc
    

How to do it...

  1. Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:

    $ cd /home/hduser
    $ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
    
  2. Launch the cluster with the example value:

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2  -s 3 launch spark-cluster
    

    Note

    • <key-pair>: This is the name of EC2 key-pair created in AWS

    • <key-file>: This is the private key file you downloaded

    • <num-slaves>: This is the number of slave nodes to launch

    • <cluster-name>: This is the name of the cluster

  3. Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2  -s 3 launch spark-cluster
    
  4. If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space):

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster
    
  5. If you use Amazon spot instances, here's the way to do it:

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2  -s 3 launch spark-cluster
    

    Note

    Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).

  6. After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.

  7. Check the status of the cluster:

  8. Now, to access the Spark cluster on EC2, let's connect to the master node using secure shell protocol (SSH):

    $ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem  login spark-cluster
    

    You should get something like the following:

  9. Check directories in the master node and see what they do:

    Directory

    Description

    ephemeral-hdfs

    This is the Hadoop instance for which data is ephemeral and gets deleted when you stop or restart the machine.

    persistent-hdfs

    Each node has a very small amount of persistent storage (approximately 3 GB). If you use this instance, data will be retained in that space.

    hadoop-native

    These are native libraries to support Hadoop, such as snappy compression libraries.

    Scala

    This is Scala installation.

    shark

    This is Shark installation (Shark is no longer supported and is replaced by Spark SQL).

    spark

    This is Spark installation

    spark-ec2

    These are files to support this cluster deployment.

    tachyon

    This is Tachyon installation

  10. Check the HDFS version in an ephemeral instance:

    $ ephemeral-hdfs/bin/hadoop version
    Hadoop 2.0.0-chd4.2.0
    
  11. Check the HDFS version in persistent instance with the following command:

    $ persistent-hdfs/bin/hadoop version
    Hadoop 2.0.0-chd4.2.0
    
  12. Change the configuration level in logs:

    $ cd spark/conf
    
  13. The default log level information is too verbose, so let's change it to Error:

    1. Create the log4.properties file by renaming the template:

      $ mv log4j.properties.template log4j.properties
      
    2. Open log4j.properties in vi or your favorite editor:

      $ vi log4j.properties
      
    3. Change second line from | log4j.rootCategory=INFO, console to | log4j.rootCategory=ERROR, console.

  14. Copy the configuration to all slave nodes after the change:

    $ spark-ec2/copydir spark/conf
    

    You should get something like this:

  15. Destroy the Spark cluster:

    $ spark-ec2 destroy spark-cluster
    

Deploying on a cluster in standalone mode


Compute resources in a distributed environment need to be managed so that resource utilization is efficient and every job gets a fair chance to run. Spark comes along with its own cluster manager conveniently called standalone mode. Spark also supports working with YARN and Mesos cluster managers.

The cluster manager that should be chosen is mostly driven by both legacy concerns and whether other frameworks, such as MapReduce, are sharing the same compute resource pool. If your cluster has legacy MapReduce jobs running, and all of them cannot be converted to Spark jobs, it is a good idea to use YARN as the cluster manager. Mesos is emerging as a data center operating system to conveniently manage jobs across frameworks, and is very compatible with Spark.

If the Spark framework is the only framework in your cluster, then standalone mode is good enough. As Spark evolves as technology, you will see more and more use cases of Spark being used as the standalone framework serving all big data compute needs. For example, some jobs may be using Apache Mahout at present because MLlib does not have a specific machine-learning library, which the job needs. As soon as MLlib gets this library, this particular job can be moved to Spark.

Getting ready

Let's consider a cluster of six nodes as an example setup: one master and five slaves (replace them with actual node names in your cluster):

Master
m1.zettabytes.com
Slaves
s1.zettabytes.com
s2.zettabytes.com
s3.zettabytes.com
s4.zettabytes.com
s5.zettabytes.com

How to do it...

  1. Since Spark's standalone mode is the default, all you need to do is to have Spark binaries installed on both master and slave machines. Put /opt/infoobjects/spark/sbin in path on every node:

    $ echo "export PATH=$PATH:/opt/infoobjects/spark/sbin" >> /home/hduser/.bashrc
    
  2. Start the standalone master server (SSH to master first):

    hduser@m1.zettabytes.com~] start-master.sh
    

    Master, by default, starts on port 7077, which slaves use to connect to it. It also has a web UI at port 8088.

  3. Please SSH to master node and start slaves:

    hduser@s1.zettabytes.com~] spark-class org.apache.spark.deploy.worker.Worker spark://m1.zettabytes.com:7077
    

    Argument (for fine-grained configuration, the following parameters work with both master and slaves)

    Meaning

    -i <ipaddress>,-ip <ipaddress>

    IP address/DNS service listens on

    -p <port>, --port <port>

    Port service listens on

    --webui-port <port>

    Port for web UI (by default, 8080 for master and 8081 for worker)

    -c <cores>,--cores <cores>

    Total CPU cores Spark applications that can be used on a machine (worker only)

    -m <memory>,--memory <memory>

    Total RAM Spark applications that can be used on a machine (worker only)

    -d <dir>,--work-dir <dir>

    The directory to use for scratch space and job output logs

  4. Rather than manually starting master and slave daemons on each node, it can also be accomplished using cluster launch scripts.

  5. First, create the conf/slaves file on a master node and add one line per slave hostname (using an example of five slaves nodes, replace with the DNS of slave nodes in your cluster):

    hduser@m1.zettabytes.com~] echo "s1.zettabytes.com" >> conf/slaves
    hduser@m1.zettabytes.com~] echo "s2.zettabytes.com" >> conf/slaves
    hduser@m1.zettabytes.com~] echo "s3.zettabytes.com" >> conf/slaves
    hduser@m1.zettabytes.com~] echo "s4.zettabytes.com" >> conf/slaves
    hduser@m1.zettabytes.com~] echo "s5.zettabytes.com" >> conf/slaves
    

    Once the slave machine is set up, you can call the following scripts to start/stop cluster:

    Script name

    Purpose

    start-master.sh

    Starts a master instance on the host machine

    start-slaves.sh

    Starts a slave instance on each node in the slaves file

    start-all.sh

    Starts both master and slaves

    stop-master.sh

    Stops the master instance on the host machine

    stop-slaves.sh

    Stops the slave instance on all nodes in the slaves file

    stop-all.sh

    Stops both master and slaves

  6. Connect an application to the cluster through the Scala code:

    val sparkContext = new SparkContext(new SparkConf().setMaster("spark://m1.zettabytes.com:7077")
    
  7. Connect to the cluster through Spark shell:

    $ spark-shell --master spark://master:7077
    

How it works...

In standalone mode, Spark follows the master slave architecture, very much like Hadoop, MapReduce, and YARN. The compute master daemon is called Spark master and runs on one master node. Spark master can be made highly available using ZooKeeper. You can also add more standby masters on the fly, if needed.

The compute slave daemon is called worker and is on each slave node. The worker daemon does the following:

  • Reports the availability of compute resources on a slave node, such as the number of cores, memory, and others, to Spark master

  • Spawns the executor when asked to do so by Spark master

  • Restarts the executor if it dies

There is, at most, one executor per application per slave machine.

Both Spark master and worker are very lightweight. Typically, memory allocation between 500 MB to 1 GB is sufficient. This value can be set in conf/spark-env.sh by setting the SPARK_DAEMON_MEMORY parameter. For example, the following configuration will set the memory to 1 gigabits for both master and worker daemon. Make sure you have sudo as the super user before running it:

$ echo "export SPARK_DAEMON_MEMORY=1g" >> /opt/infoobjects/spark/conf/spark-env.sh

By default, each slave node has one worker instance running on it. Sometimes, you may have a few machines that are more powerful than others. In that case, you can spawn more than one worker on that machine by the following configuration (only on those machines):

$ echo "export SPARK_WORKER_INSTANCES=2" >> /opt/infoobjects/spark/conf/spark-env.sh

Spark worker, by default, uses all cores on the slave machine for its executors. If you would like to limit the number of cores the worker can use, you can set it to that number (for example, 12) by the following configuration:

$ echo "export SPARK_WORKER_CORES=12" >> /opt/infoobjects/spark/conf/spark-env.sh

Spark worker, by default, uses all the available RAM (1 GB for executors). Note that you cannot allocate how much memory each specific executor will use (you can control this from the driver configuration). To assign another value for the total memory (for example, 24 GB) to be used by all executors combined, execute the following setting:

$ echo "export SPARK_WORKER_MEMORY=24g" >> /opt/infoobjects/spark/conf/spark-env.sh

There are some settings you can do at the driver level:

  • To specify the maximum number of CPU cores to be used by a given application across the cluster, you can set the spark.cores.max configuration in Spark submit or Spark shell as follows:

    $ spark-submit --conf spark.cores.max=12
    
  • To specify the amount of memory each executor should be allocated (the minimum recommendation is 8 GB), you can set the spark.executor.memory configuration in Spark submit or Spark shell as follows:

    $ spark-submit --conf spark.executor.memory=8g
    

The following diagram depicts the high-level architecture of a Spark cluster:

See also

Deploying on a cluster with Mesos


Mesos is slowly emerging as a data center operating system to manage all compute resources across a data center. Mesos runs on any computer running the Linux operating system. Mesos is built using the same principles as Linux kernel. Let's see how we can install Mesos.

How to do it...

Mesosphere provides a binary distribution of Mesos. The most recent package for the Mesos distribution can be installed from the Mesosphere repositories by performing the following steps:

  1. Execute Mesos on Ubuntu OS with the trusty version:

    $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]') CODENAME=$(lsb_release -cs)
    $ sudo vi /etc/apt/sources.list.d/mesosphere.list
    
    deb http://repos.mesosphere.io/Ubuntu trusty main
    
  2. Update the repositories:

    $ sudo apt-get -y update
    
  3. Install Mesos:

    $ sudo apt-get -y install mesos
    
  4. To connect Spark to Mesos to integrate Spark with Mesos, make Spark binaries available to Mesos and configure the Spark driver to connect to Mesos.

  5. Use Spark binaries from the first recipe and upload to HDFS:

    $ 
    hdfs dfs
     -put spark-1.4.0-bin-hadoop2.4.tgz spark-1.4.0-bin-hadoop2.4.tgz
    
  6. The master URL for single master Mesos is mesos://host:5050, and for the ZooKeeper managed Mesos cluster, it is mesos://zk://host:2181.

  7. Set the following variables in spark-env.sh:

    $ sudo vi spark-env.sh
    export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
    export SPARK_EXECUTOR_URI= hdfs://localhost:9000/user/hduser/spark-1.4.0-bin-hadoop2.4.tgz
    
  8. Run from the Scala program:

    val conf = new SparkConf().setMaster("mesos://host:5050")
    val sparkContext = new SparkContext(conf)
    
  9. Run from the Spark shell:

    $ spark-shell --master mesos://host:5050
    

    Note

    Mesos has two run modes:

    Fine-grained: In fine-grained (default) mode, every Spark task runs as a separate Mesos task

    Coarse-grained: This mode will launch only one long-running Spark task on each Mesos machine

  10. To run in the coarse-grained mode, set the spark.mesos.coarse property:

    conf.set("spark.mesos.coarse","true")
    

Deploying on a cluster with YARN


Yet another resource negotiator (YARN) is Hadoop's compute framework that runs on top of HDFS, which is Hadoop's storage layer.

YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application.

When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes.

While running Spark with YARN, each Spark executor is run as YARN container.

Getting ready

Running Spark on YARN requires a binary distribution of Spark that has YARN support. In both Spark installation recipes, we have taken care of it.

How to do it...

  1. To run Spark on YARN, the first step is to set the configuration:

    HADOOP_CONF_DIR: to write to HDFS
    YARN_CONF_DIR: to connect to YARN ResourceManager
    $ cd /opt/infoobjects/spark/conf (or /etc/spark)
    $ sudo vi spark-env.sh
    export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop
    export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop
    

    You can see this in the following screenshot:

  2. The following command launches YARN Spark in the yarn-client mode:

    $ spark-submit --class path.to.your.Class --master yarn-client [options] <app jar> [app options]
    

    Here's an example:

    $ spark-submit --class com.infoobjects.TwitterFireHose --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 target/sparkio.jar 10
    
  3. The following command launches Spark shell in the yarn-client mode:

    $ spark-shell --master yarn-client
    
  4. The command to launch in the yarn-cluster mode is as follows:

    $ spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
    

    Here's an example:

    $ spark-submit --class com.infoobjects.TwitterFireHose --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 targe
    t/sparkio.jar 10
    

How it works…

Spark applications on YARN run in two modes:

  • yarn-client: Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from ResourceManager

  • yarn-cluster: Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node

The yarn-cluster mode is recommended for production deployments, while the yarn-client mode is good for development and debugging when you would like to see immediate output. There is no need to specify Spark master in either mode as it's picked from the Hadoop configuration, and the master parameter is either yarn-client or yarn-cluster.

The following figure shows how Spark is run with YARN in the client mode:

The following figure shows how Spark is run with YARN in the cluster mode:

In the YARN mode, the following configuration parameters can be set:

  • --num-executors: Configure how many executors will be allocated

  • --executor-memory: RAM per executor

  • --executor-cores: CPU cores per executor

Using Tachyon as an off-heap storage layer


Spark RDDs are a great way to store datasets in memory while ending up with multiple copies of the same data in different applications. Tachyon solves some of the challenges with Spark RDD management. A few of them are:

  • RDD only exists for the duration of the Spark application

  • The same process performs the compute and RDD in-memory storage; so, if a process crashes, in-memory storage also goes away

  • Different jobs cannot share an RDD even if they are for the same underlying data, for example, an HDFS block that leads to:

    • Slow writes to disk

    • Duplication of data in memory, higher memory footprint

  • If the output of one application needs to be shared with the other application, it's slow due to the replication in the disk

Tachyon provides an off-heap memory layer to solve these problems. This layer, being off-heap, is immune to process crashes and is also not subject to garbage collection. This also lets RDDs be shared across applications and outlive a specific job or session; in essence, one single copy of data resides in memory, as shown in the following figure:

How to do it...

  1. Let's download and compile Tachyon (Tachyon, by default, comes configured for Hadoop 1.0.4, so it needs to be compiled from sources for the right Hadoop version). Replace the version with the current version. The current version at the time of writing this book is 0.6.4:

    $ wget https://github.com/amplab/tachyon/archive/v<version>.zip
    
  2. Unarchive the source code:

    $ unzip  v-<version>.zip
    
  3. Remove the version from the tachyon source folder name for convenience:

    $ mv tachyon-<version> tachyon
    
  4. Change the directory to the tachyon folder:

    $ cd tachyon
    $ mvn -Dhadoop.version=2.4.0 clean package -DskipTests=true
    $ cd conf
    $ sudo mkdir -p /var/tachyon/journal
    $ sudo chown -R hduser:hduser /var/tachyon/journal
    $ sudo mkdir -p /var/tachyon/ramdisk
    $ sudo chown -R hduser:hduser /var/tachyon/ramdisk
    
    $ mv tachyon-env.sh.template tachyon-env.sh
    $ vi tachyon-env.sh
    
  5. Comment the following line:

    export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs
    
  6. Uncomment the following line:

    export TACHYON_UNDERFS_ADDRESS=hdfs://localhost:9000
    
  7. Change the following properties:

    -Dtachyon.master.journal.folder=/var/tachyon/journal/
    
    export TACHYON_RAM_FOLDER=/var/tachyon/ramdisk
    
    $ sudo mkdir -p /var/log/tachyon
    $ sudo chown -R hduser:hduser /var/log/tachyon
    $ vi log4j.properties
    
  8. Replace ${tachyon.home} with /var/log/tachyon.

  9. Create a new core-site.xml file in the conf directory:

    $ sudo vi core-site.xml
    <configuration>
    <property>
        <name>fs.tachyon.impl</name>
        <value>tachyon.hadoop.TFS</value>
      </property>
    </configuration>
    $ cd ~
    $ sudo mv tachyon /opt/infoobjects/
    $ sudo chown -R root:root /opt/infoobjects/tachyon
    $ sudo chmod -R 755 /opt/infoobjects/tachyon
    
  10. Add <tachyon home>/bin to the path:

    $ echo "export PATH=$PATH:/opt/infoobjects/tachyon/bin" >> /home/hduser/.bashrc
    
  11. Restart the shell and format Tachyon:

    $ tachyon format
    $ tachyon-start.sh local //you need to enter root password as RamFS needs to be formatted
    

    Tachyon's web interface is http://hostname:19999:

  12. Run the sample program to see whether Tachyon is running fine:

    $ tachyon runTest Basic CACHE_THROUGH
    
  13. You can stop Tachyon any time by running the following command:

    $ tachyon-stop.sh
    
  14. Run Spark on Tachyon:

    $ spark-shell
    scala> val words = sc.textFile("tachyon://localhost:19998/words")
    scala> words.count
    scala> words.saveAsTextFile("tachyon://localhost:19998/w2")
    scala> val person = sc.textFile("hdfs://localhost:9000/user/hduser/person")
    scala> import org.apache.spark.api.java._
    scala> person.persist(StorageLevels.OFF_HEAP)
    
Left arrow icon Right arrow icon

Key benefits

What you will learn

Install and configure Apache Spark with various cluster managers Set up development environments Perform interactive queries using Spark SQL Get to grips with realtime streaming analytics using Spark Streaming Master supervised learning and unsupervised learning using MLlib Build a recommendation engine using MLlib Develop a set of common applications or project types, and solutions that solve complex big data problems Use Apache Spark as your single big data compute platform and master its libraries
Estimated delivery fee Deliver to Czechia

Premium delivery 7 - 10 business days

€26.95
(Includes tracking information)

Product Details

Country selected

Publication date : Jul 27, 2015
Length 226 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783987061
Vendor :
Apache
Category :
Concepts :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Estimated delivery fee Deliver to Czechia

Premium delivery 7 - 10 business days

€26.95
(Includes tracking information)

Product Details


Publication date : Jul 27, 2015
Length 226 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783987061
Vendor :
Apache
Category :
Concepts :

Table of Contents

19 Chapters
Spark Cookbook Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. Getting Started with Apache Spark Chevron down icon Chevron up icon
2. Developing Applications with Spark Chevron down icon Chevron up icon
3. External Data Sources Chevron down icon Chevron up icon
4. Spark SQL Chevron down icon Chevron up icon
5. Spark Streaming Chevron down icon Chevron up icon
6. Getting Started with Machine Learning Using MLlib Chevron down icon Chevron up icon
7. Supervised Learning with MLlib – Regression Chevron down icon Chevron up icon
8. Supervised Learning with MLlib – Classification Chevron down icon Chevron up icon
9. Unsupervised Learning with MLlib Chevron down icon Chevron up icon
10. Recommender Systems Chevron down icon Chevron up icon
11. Graph Processing Using GraphX Chevron down icon Chevron up icon
12. Optimizations and Performance Tuning Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Top Reviews
No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela