Packt+ | Advance your knowledge in tech

You're reading from Mastering Apache Spark 2.x - Second Edition

Product type Book

Published in Jul 2017

Publisher Packt

ISBN-13 9781786462749

Pages 354 pages

Edition 2nd Edition

Languages

Scala

Concepts

Data Streaming

Table of Contents (21) Chapters

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. A First Taste and What’s New in Apache Spark V2

2. Apache Spark SQL

3. The Catalyst Optimizer

4. Project Tungsten

5. Apache Spark Streaming

6. Structured Streaming

7. Apache Spark MLlib

8. Apache SparkML

9. Apache SystemML

10. Deep Learning on Apache Spark with DeepLearning4j and H2O

11. Apache Spark GraphX

12. Apache Spark GraphFrames

13. Apache Spark with Jupyter Notebooks on IBM DataScience Experience

14. Apache Spark on Kubernetes

Cluster management

The Spark context, as you will see in many of the examples in this book, can be defined via a Spark configuration object and Spark URL. The Spark context connects to the Spark cluster manager, which then allocates resources across the worker nodes for the application. The cluster manager allocates executors across the cluster worker nodes. It copies the application JAR file to the workers and finally allocates tasks.

The following subsections describe the possible Apache Spark cluster manager options available at this time.

Local

By specifying a Spark configuration local URL, it is possible to have the application run locally. By specifying local[n], it is possible to have Spark use n threads to run the application locally. This is a useful development and test option because you can also test some sort of parallelization scenarios but keep all log files on a single machine.

Standalone

Standalone mode uses a basic cluster manager that is supplied with Apache Spark. The spark master URL will be as follows:

Spark://<hostname>:7077

Here, <hostname> is the name of the host on which the Spark master is running. We have specified 7077 as the port, which is the default value, but this is configurable. This simple cluster manager currently supports only FIFO (first-in first-out) scheduling. You can contrive to allow concurrent application scheduling by setting the resource configuration options for each application; for instance, using spark.core.max to share cores between applications.

Apache YARN

At a larger scale, when integrating with Hadoop YARN, the Apache Spark cluster manager can be YARN and the application can run in one of two modes. If the Spark master value is set as yarn-cluster, then the application can be submitted to the cluster and then terminated. The cluster will take care of allocating resources and running tasks. However, if the application master is submitted as yarn-client, then the application stays alive during the life cycle of processing, and requests resources from YARN.

Apache Mesos

Apache Mesos is an open source system for resource sharing across a cluster. It allows multiple frameworks to share a cluster by managing and scheduling resources. It is a cluster manager that provides isolation using Linux containers and allowing multiple systems such as Hadoop, Spark, Kafka, Storm, and more to share a cluster safely. It is highly scalable to thousands of nodes. It is a master/slave-based system and is fault tolerant, using Zookeeper for configuration management.

For a single master node Mesos cluster, the Spark master URL will be in this form:

mesos://<hostname>:5050.

Here, <hostname> is the hostname of the Mesos master server; the port is defined as 5050, which is the default Mesos master port (this is configurable). If there are multiple Mesos master servers in a large-scale high availability Mesos cluster, then the Spark master URL would look as follows:

mesos://zk://<hostname>:2181.

So, the election of the Mesos master server will be controlled by Zookeeper. The <hostname> will be the name of a host in the Zookeeper quorum. Also, the port number, 2181, is the default master port for Zookeeper.