Packt+ | Advance your knowledge in tech

You're reading from PySpark Cookbook Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

Product type Paperback

Published in Jun 2018

Publisher Packt

ISBN-13 9781788835367

Length 330 pages

Edition 1st Edition

Languages

Python

Tools

Apache Spark

Concepts

Big Data

Authors (2):

Tomasz Drabas

Denny Lee

View More author details

To create a SparkSession, we will use the Builder class (accessed via the .builder property of the SparkSession class). You can specify some basic properties of the SparkSession here:

The .master(...) allows you to specify the driver node (in our preceding example, we would be running a local session with two cores)

The .appName(...) gives you means to specify a friendly name for your app

The .config(...) method allows you to refine your session's behavior further; the list of the most important SparkSession parameters is outlined in the following table
The .getOrCreate() method returns either a new SparkSession if one has not been created yet, or returns a pointer to an already existing SparkSession

The following table gives an example list of the most useful configuration parameters for a local instance of Spark:

Some of these parameters are also applicable if you are working in a cluster environment with multiple worker nodes. In the next recipe, we will explain how to set up and administer a multi-node Spark cluster deployed over YARN.

Parameter	Function	Default
`spark.app.name`	Specifies a friendly name for your application	(none)
`spark.driver.cores`	Number of cores for the driver node to use. This is only applicable for app deployments in a cluster mode (see the following `spark.submit.deployMode` parameter).	1
`spark.driver.memory`	Specifies the amount of memory for the driver process. If using `spark-submit` in client mode, you should specify this in a command line using `--driver-memory` switch rather than configuring your session using this parameter as JVM would have already started at this point.	1g
`spark.executor.cores`	Number of cores for an executor to use. Setting this parameter while running locally allows you to use all the available cores on your machine.	1 in YARN deployment, all available cores on the worker in standalone and Mesos deployments
`spark.executor.memory`	Specifies the amount of memory per each executor process.	1g
`spark.submit.pyFiles`	List of `.zip`, `.egg`, or `.py` files, separated by commas. These will be added to the `PYTHONPATH` so that they are accessible for Python apps.	(none)
`spark.submit.deployMode`	Deploy mode of the Spark driver program. Specifying `'client'` will launch the driver program locally on the machine (it can be the driver node), while specifying `'cluster'` will utilize one of the nodes on a remote cluster.	(none)
`spark.pyspark.python`	Python binary that should be used by the driver and all the executors.	(none)

There are some environment variables that also allow you to further fine-tune your Spark environment. Specifically, we are talking about the PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables. We have already covered these in the Installing Spark from sources recipe.

You're reading from PySpark Cookbook Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

Table of Contents (9) Chapters

Configuring a local instance of Spark

Getting ready

How to do it...

How it works...

See also

Authors (2)

Other recommended products

Personalised recommendations for you