For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.
Starting and configuring a Spark cluster
Getting ready
Import the following before initializing cluster.
- from pyspark.sql import SparkSession
How to do it...
This section walks through the steps to initialize and configure a Spark cluster.
- Import SparkSession using the following script:
from pyspark.sql import SparkSession
- Configure SparkSession with a variable named spark using the following script:
spark = SparkSession.builder \
.master("local[*]") \
.appName("GenericAppName") \
.config("spark.executor.memory", "6gb") \
.getOrCreate()
How it works...
This section explains how the SparkSession works as an entry point to develop within Spark.
- Staring with Spark 2.0, it is no longer necessary to create a SparkConf and SparkContext to begin development in Spark. Those steps are no longer needed as importing SparkSession will handle initializing a cluster. Additionally, it is important to note that SparkSession is part of the sql module from pyspark.
- We can assign properties to our SparkSession:
- master: assigns the Spark master URL to run on our local machine with the maximum available number of cores.
- appName: assign a name for the application
- config: assign 6gb to the spark.executor.memory
- getOrCreate: ensures that a SparkSession is created if one is not available and retrieves an existing one if it is available
There's more...
For development purposes, while we are building an application on smaller datasets, we can just use master("local"). If we were to deploy on a production environment, we would want to specify master("local[*]") to ensure we are using the maximum cores available and get optimal performance.
See also
To learn more about SparkSession.builder, visit the following website:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/SparkSession.Builder.html