Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PySpark Cookbook

You're reading from  PySpark Cookbook

Product type Book
Published in Jun 2018
Publisher Packt
ISBN-13 9781788835367
Pages 330 pages
Edition 1st Edition
Languages
Authors (2):
Denny Lee Denny Lee
Profile icon Denny Lee
Tomasz Drabas Tomasz Drabas
Profile icon Tomasz Drabas
View More author details
Toc

Table of Contents (13) Chapters close

Title Page
Packt Upsell
Contributors
Preface
1. Installing and Configuring Spark 2. Abstracting Data with RDDs 3. Abstracting Data with DataFrames 4. Preparing Data for Modeling 5. Machine Learning with MLlib 6. Machine Learning with the ML Module 7. Structured Streaming with PySpark 8. GraphFrames – Graph Theory with PySpark Index

Configuring a local instance of Spark


There is actually not much you need to do to configure a local instance of Spark. The beauty of Spark is that all you need to do to get started is to follow either of the previous two recipes (installing from sources or from binaries) and you can begin using it. In this recipe, however, we will walk you through the most useful SparkSession configuration options.

Getting ready

In order to follow this recipe, a working Spark environment is required. This means that you will have to have gone through the previous three recipes and have successfully installed and tested your environment, or had a working Spark environment already set up.

No other prerequisites are necessary.

How to do it...

To configure your session, in a Spark version which is lower that version 2.0, you would normally have to create a SparkConf object, set all your options to the right values, and then build the SparkContext ( SqlContext if you wanted to use DataFrames, and HiveContext if you wanted access to Hive tables). Starting from Spark 2.0, you just need to create a SparkSession, just like in the following snippet:

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("Your-app-name") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate() 

How it works...

To create a SparkSession, we will use the Builder class (accessed via the .builder property of the SparkSession class). You can specify some basic properties of the SparkSession here:

  • The .master(...) allows you to specify the driver node (in our preceding example, we would be running a local session with two cores)
  • The .appName(...) gives you means to specify a friendly name for your app
  • The .config(...) method allows you to refine your session's behavior further; the list of the most important SparkSession parameters is outlined in the following table
  • The .getOrCreate() method returns either a new SparkSession if one has not been created yet, or returns a pointer to an already existing SparkSession

The following table gives an example list of the most useful configuration parameters for a local instance of Spark:

Note

Some of these parameters are also applicable if you are working in a cluster environment with multiple worker nodes. In the next recipe, we will explain how to set up and administer a multi-node Spark cluster deployed over YARN. 

Parameter

Function

Default

spark.app.name

Specifies a friendly name for your application

(none)

spark.driver.cores

Number of cores for the driver node to use. This is only applicable for app deployments in a cluster mode (see the following spark.submit.deployMode parameter).

1

spark.driver.memory

Specifies the amount of memory for the driver process. If using spark-submit in client mode, you should specify this in a command line using --driver-memory switch rather than configuring your session using this parameter as JVM would have already started at this point.

1g

spark.executor.cores

Number of cores for an executor to use. Setting this parameter while running locally allows you to use all the available cores on your machine.

1 in YARN deployment, all available cores on the worker in standalone and Mesos deployments

spark.executor.memory

Specifies the amount of memory per each executor process.

1g

spark.submit.pyFiles

List of .zip, .egg, or .py files, separated by commas. These will be added to the PYTHONPATH so that they are accessible for Python apps.

(none)

spark.submit.deployMode

Deploy mode of the Spark driver program. Specifying 'client' will launch the driver program locally on the machine (it can be the driver node), while specifying 'cluster' will utilize one of the nodes on a remote cluster.

(none)

spark.pyspark.python

Python binary that should be used by the driver and all the executors.

(none)

 

There are some environment variables that also allow you to further fine-tune your Spark environment. Specifically, we are talking about thePYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables. We have already covered these in theInstalling Spark from sourcesrecipe.

See also

  • Check the full list of all available configuration options here: https://spark.apache.org/docs/latest/configuration.html
You have been reading a chapter from
PySpark Cookbook
Published in: Jun 2018 Publisher: Packt ISBN-13: 9781788835367
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime