You're reading from Apache Spark Deep Learning Cookbook Over 80 best practice recipes for the distributed training and deployment of neural networks using Keras and TensorFlow

Product type Paperback

Published in Jul 2018

Publisher Packt

ISBN-13 9781788474221

Length 474 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Deep Learning

Authors (2):

Ahmed Sherif

Amrith Ravindra

View More author details

Starting and configuring a Spark cluster

For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.

Getting ready

Import the following before initializing cluster.

from pyspark.sql import SparkSession

How to do it...

This section walks through the steps to initialize and configure a Spark cluster.

Import SparkSession using the following script:

from pyspark.sql import SparkSession

Configure SparkSession with a variable named spark using the following script:

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("GenericAppName") \
    .config("spark.executor.memory", "6gb") \
.getOrCreate()

How it works...

This section explains how the SparkSession works as an entry point to develop within Spark.

Staring with Spark 2.0, it is no longer necessary to create a SparkConf and SparkContext to begin development in Spark. Those steps are no longer needed as importing SparkSession will handle initializing a cluster. Additionally, it is important to note that SparkSession is part of the sql module from pyspark.
We can assign properties to our SparkSession:
1. master: assigns the Spark master URL to run on our local machine with the maximum available number of cores.
2. appName: assign a name for the application
3. config: assign 6gb to the spark.executor.memory
4. getOrCreate: ensures that a SparkSession is created if one is not available and retrieves an existing one if it is available

There's more...

For development purposes, while we are building an application on smaller datasets, we can just use master("local"). If we were to deploy on a production environment, we would want to specify master("local[*]") to ensure we are using the maximum cores available and get optimal performance.