You're reading from Fast Data Processing with Spark 2 Accelerate your data for rapid insight

Product type Paperback

Published in Oct 2016

Publisher Packt

ISBN-13 9781785889271

Length 274 pages

Edition 3rd Edition

Languages

Scala

Tools

Apache Spark

Concepts

Data Processing

Authors (2):

Krishna Sankar

Holden Karau

View More author details

Table of Contents (13) Chapters

Preface

1. Installing Spark and Setting Up Your Cluster

2. Using the Spark Shell FREE CHAPTER

3. Building and Running a Spark Application

4. Creating a SparkSession Object

5. Loading and Saving Data in Spark

6. Manipulating Your RDD

7. Spark 2.0 Concepts

8. Spark SQL

9. Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists

10. Spark with Big Data

11. Machine Learning with Spark ML Pipelines

12. GraphX

Spark standalone mode

If you have a set of machines without any existing cluster management software, you can deploy Spark over SSH with some handy scripts. This method is known as standalone mode in the Spark documentation at http://spark.apache.org/docs/latest/spark-standalone.html. An individual master and worker can be started by sbin/start-master.sh and sbin/start-slaves.sh, respectively. The default port for the master is 8080. As you likely don't want to go to each of your machines and run these commands by hand, there are a number of helper scripts in bin/ to help you run your servers.

A prerequisite for using any of the scripts is having password less SSH access set up from the master to all of the worker machines. You probably want to create a new user for running Spark on the machines and lock it down. This book uses the username sparkuser. On your master, you can run ssh-keygen to generate the SSH keys and make sure that you do not set a password. Once you have generated the key, add the public one (if you generated an RSA key, it would be stored in ~/.ssh/id_rsa.pub by default) to ~/.ssh/authorized_keys2 on each of the hosts.

Tip

The Spark administration scripts require that your usernames match. If this isn't the case, you can configure an alternative username in your ~/.ssh/config.

Now that you have the SSH access to the machines set up, it is time to configure Spark. There is a simple template in [filepath]conf/spark-env.sh.template[/filepath], which you should copy to [filepath]conf/spark-env.sh[/filepath].

You may also find it useful to set some (or all) of the environment variables shown in the following table:

Name	Purpose	Default
`MESOS_NATIVE_LIBRARY`	This variable is used to point to math where Mesos lives.	None
`SCALA_HOME`	This variable is used to point to where you extracted Scala.	None, must be set
`SPARK_MASTER_IP`	This variable states the IP address for the master to listen on and the IP address for the workers to connect to.	The result of running hostname
`SPARK_MASTER_PORT`	This variable states the port `#` for the Spark master to listen on.	`7077`
`SPARK_MASTER_WEBUI_PORT`	This variable states the port `#` of the WEB UI on the master.	`8080`
`SPARK_WORKER_CORES`	This variable states the number of cores to use.	All of them
`SPARK_WORKER_MEMORY`	This variable states how much memory to use.	Max of (system memory-1 GB, 512 MB)
`SPARK_WORKER_PORT`	This variable states what port `#` the worker runs on.	`Rand`
`SPARK_WEBUI_PORT`	This variable states what port `#` the worker WEB UI runs on.	`8081`
`SPARK_WORKER_DIR`	This variable states where to store files from the worker.	`SPARK_HOME/work_dir`

Once you have completed your configuration, it's time to get your cluster up and running. You will want to copy the version of Spark and the configuration you have built to all of your machines. You may find it useful to install pssh, a set of parallel SSH tools, including pscp. The pscp tool makes it easy to scp to a number of target hosts, although it will take a while, as shown here:

pscp -v -r -h conf/slaves -l sparkuser ../opt/spark ~/

If you end up changing the configuration, you need to distribute the configuration to all of the workers, as shown here:

pscp -v -r -h conf/slaves -l sparkuser conf/spark-env.sh  /opt/spark/conf/spark-env.sh

Tip

If you use a shared NFS on your cluster, while by default Spark names log files and similar with shared names, you should configure a separate worker directory; otherwise, they will be configured to write to the same place. If you want to have your worker directories on the shared NFS, consider adding 'hostname', for example SPARK_WORKER_DIR=~/work-'hostname'.

You should also consider having your log files go to a scratch directory for performance.

Now you are ready to start the cluster and you can use the sbin/start-all.sh, sbin/start-master.sh, and sbin/start-slaves.sh scripts. It is important to note that start-all.sh and start-master.sh both assume that they are being run on the node, which is the master for the cluster. The start scripts all daemonize, and so you don't have to worry about running them on a screen.

ssh master bin/start-all.sh

If you get a class not found error stating java.lang.NoClassDefFoundError: scala/ScalaObject, check to make sure that you have Scala installed on that worker host and that the SCALA_HOME is set correctly.

Tip

The Spark scripts assume that your master has Spark installed in the same directory as your workers. If this is not the case, you should edit bin/spark-config.sh and set it to the appropriate directories.

The commands provided by Spark to help you administer your cluster are given in the following table. More details are available on the Spark website at http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts.

Command	Use
`bin/slaves.sh <command>`	This command runs the provided command on all of the worker hosts. For example, `bin/slave.sh uptime` will show how long each of the worker hosts have been up.
`bin/start-all.sh`	This command starts the master and all of the worker hosts. This command must be run on the master.
`bin/start-master.sh`	This command starts the master host. This command must be run on the master.
`bin/start-slaves.sh`	This command starts the worker hosts.
`bin/start-slave.sh`	This command starts a specific worker.
`bin/stop-all.sh`	This command stops master and workers.
`bin/stop-master.sh`	This command stops the master.
`bin/stop-slaves.sh`	This command stops all the workers.

You now have a running Spark cluster. There is a handy Web UI on the master running on port 8080 you should go and visit, and on all of the workers on port 8081. The Web UI contains helpful information such as the current workers, and current and past jobs.

Now that you have a cluster up and running, let's actually do something with it. As with the single host example, you can use the provided run script to run the Spark commands. All of the examples listed in examples/src/main/scala/spark/org/apache/spark/examples/ take a parameter, master, which points them to the master. Assuming that you are on the master host, you could run them with this command:

./run-example GroupByTest spark://'hostname':7077

Tip

If you run into an issue with java.lang.UnsupportedClassVersionError, you may need to update your JDK or recompile Spark if you grabbed the binary version. Version 1.1.0 was compiled with JDK 1.7 as the target. You can check the version of the JRE targeted by Spark with the following commands:

java -verbose -classpath ./core/target/scala- 2.9.2/classes/
spark.SparkFiles |head -n 20

Version 49 is JDK1.5, Version 50 is JDK1.6, and Version 60 is JDK1.7.

If you can't connect to localhost, make sure that you've configured your master (spark.driver.port) to listen to all of the IP addresses (or if you don't want to replace localhost with the IP address configured to listen to). More port configurations are listed at http://spark.apache.org/docs/latest/configuration.html#networking.

If everything has worked correctly, you will see the following log messages output to stdout:

13/03/28 06:35:31 INFO spark.SparkContext: Job finished: count at GroupByTest.scala:35, took 2.482816756 s2000

The rest of the chapter is locked

You're reading from Fast Data Processing with Spark 2 Accelerate your data for rapid insight

Table of Contents (13) Chapters

Spark standalone mode

Tip

Tip

Tip

Tip

Unlock this book and the full library FREE for 7 days

Authors (2)

Personalised recommendations for you