Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PySpark Cookbook

You're reading from   PySpark Cookbook Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

Arrow left icon
Product type Paperback
Published in Jun 2018
Publisher Packt
ISBN-13 9781788835367
Length 330 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Tomasz Drabas Tomasz Drabas
Author Profile Icon Tomasz Drabas
Tomasz Drabas
Denny Lee Denny Lee
Author Profile Icon Denny Lee
Denny Lee
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface 1. Installing and Configuring Spark FREE CHAPTER 2. Abstracting Data with RDDs 3. Abstracting Data with DataFrames 4. Preparing Data for Modeling 5. Machine Learning with MLlib 6. Machine Learning with the ML Module 7. Structured Streaming with PySpark 8. GraphFrames – Graph Theory with PySpark

Configuring a session in Jupyter

Working in Jupyter is great as it allows you to develop your code interactively, and document and share your notebooks with colleagues. The problem, however, with running Jupyter against a local Spark instance is that the SparkSession gets created automatically and by the time the notebook is running, you cannot change much in that session's configuration.

In this recipe, we will learn how to install Livy, a REST service to interact with Spark, and sparkmagic, a package that will allow us to configure sessions interactively as well:

Source: http://bit.ly/2iO3EwC

Getting ready

We assume that you either have installed Spark via binaries or compiled the sources as we have shown you in the previous recipes. In other words, by now, you should have a working Spark environment. You will also need Jupyter: if you do not have it, follow the steps from the previous recipe to install it. 

No other prerequisites are required.

How to do it...

To install Livy and sparkmagic, we have created a script that will do this automatically with minimal interaction from you. You can find it in the Chapter01/installLivy.sh folder. You should be familiar with most of the functions that we're going to use here by now, so we will focus only on those that are different (highlighted in bold in the following code). Here is the high-level view of the script's structure:

#!/bin/bash

# Shell script for installing Spark from binaries
#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017

_livy_binary="http://mirrors.ocf.berkeley.edu/apache/incubator/livy/0.4.0-incubating/livy-0.4.0-incubating-bin.zip"
_livy_archive=$( echo "$_livy_binary" | awk -F '/' '{print $NF}' )
_livy_dir=$( echo "${_livy_archive%.*}" )
_livy_destination="/opt/livy"
_hadoop_destination="/opt/hadoop"
...
checkOS
printHeader
createTempDir
downloadThePackage $( echo "${_livy_binary}" )
unpack $( echo "${_livy_archive}" )
moveTheBinaries $( echo "${_livy_dir}" ) $( echo "${_livy_destination}" )

# create log directory inside the folder
mkdir -p "$_livy_destination/logs"

checkHadoop
installJupyterKernels
setSparkEnvironmentVariables
cleanUp

How it works...

As with all other scripts we have presented so far, we will begin by setting some global variables.

If you do not know what these mean, check the Installing Spark from sources recipe.

Livy requires some configuration files from Hadoop. Thus, as part of this script, we allow you to install Hadoop should it not be present on your machine. That is why we now allow you to pass arguments to the downloadThePackage, unpack, and moveTheBinaries functions.

The changes to the functions are fairly self-explanatory, so for the sake of space, we will not be pasting the code here. You are more than welcome, though, to peruse the relevant portions of the installLivy.sh script.

Installing Livy drills down literally to downloading the package, unpacking it, and moving it to its final destination (in our case, this is /opt/livy). 

Checking if Hadoop is installed is the next thing on our to-do list. To run Livy with local sessions, we require two environment variables: SPARK_HOME and HADOOP_CONF_DIR; the SPARK_HOME is definitely set but if you do not have Hadoop installed, you most likely will not have the latter environment variable set:

function checkHadoop() {
if type -p hadoop; then
echo "Hadoop executable found in PATH"
_hadoop=hadoop
elif [[ -n "$HADOOP_HOME" ]] && [[ -x "$HADOOP_HOME/bin/hadoop" ]]; then
echo "Found Hadoop executable in HADOOP_HOME"
_hadoop="$HADOOP_HOME/bin/hadoop"
else
echo "No Hadoop found. You should install Hadoop first. You can still continue but some functionality might not be available. "
echo
echo -n "Do you want to install the latest version of Hadoop? [y/n]: "
read _install_hadoop

case "$_install_hadoop" in
y*) installHadoop ;;
n*) echo "Will not install Hadoop" ;;
*) echo "Will not install Hadoop" ;;
esac
fi
}

function installHadoop() {
_hadoop_binary="http://mirrors.ocf.berkeley.edu/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz"
_hadoop_archive=$( echo "$_hadoop_binary" | awk -F '/' '{print $NF}' )
_hadoop_dir=$( echo "${_hadoop_archive%.*}" )
_hadoop_dir=$( echo "${_hadoop_dir%.*}" )

downloadThePackage $( echo "${_hadoop_binary}" )
unpack $( echo "${_hadoop_archive}" )
moveTheBinaries $( echo "${_hadoop_dir}" ) $( echo "${_hadoop_destination}" )
}

The checkHadoop function first checks if the hadoop binary is present on the PATH; if not, it will check if the HADOOP_HOME variable is set and, if it is, it will check if the hadoop binary can be found inside the $HADOOP_HOME/bin folder. If both attempts fail, the script will ask you if you want to install the latest version of Hadoop; the default answer is n but if you answer y, the installation will begin.

Once the installation finishes, we will begin installing the additional kernels for the Jupyter Notebooks.

A kernel is a piece of software that translates the commands from the frontend notebook to the backend environment (like Python). For a list of available Jupyter kernels check out the following link: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels. Here are some instructions on how to develop a kernel yourself: http://jupyter-client.readthedocs.io/en/latest/kernels.html.

Here's the function that handles the kernel's installation:

function installJupyterKernels() {
# install the library
pip install sparkmagic
echo

# ipywidgets should work properly
jupyter nbextension enable --py --sys-prefix widgetsnbextension
echo

# install kernels
# get the location of sparkmagic
_sparkmagic_location=$(pip show sparkmagic | awk -F ':' '/Location/ {print $2}')

_temp_dir=$(pwd) # store current working directory

cd $_sparkmagic_location # move to the sparkmagic folder
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel

echo

# enable the ability to change clusters programmatically
jupyter serverextension enable --py sparkmagic
echo

# install autowizwidget
pip install autovizwidget

cd $_temp_dir
}

First, we install the sparkmagic package for Python. Quoting directly from https://github.com/jupyter-incubator/sparkmagic:

"Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter Notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment."

The following command enables the Javascript extensions in Jupyter Notebooks so that ipywidgets can work properly; if you have an Anaconda distribution of Python, this package will be installed automatically.

Following this, we install the kernels. We need to switch to the folder where sparkmagic was installed into. The pip show <package> command displays all relevant information about the installed packages; from the output, we only extract the Location using awk.

To install the kernels, we use the jupyter-kernelspec install <kernel> command. For example, the command will install the sparkmagic kernel for the Scala API of Spark:

jupyter-kernelspec install sparkmagic/kernels/sparkkernel 

Once all the kernels are installed, we enable Jupyter to use sparkmagic so that we can change clusters programmatically. Finally, we will install the autovizwidget, an auto-visualization library for pandas dataframes.

This concludes the Livy and sparkmagic installation part.

There's more...

Now that we have everything in place, let's see what this can do. 

First, start Jupyter (note that we do not use the pyspark command):

jupyter notebook

You should now be able to see the following options if you want to add a new notebook:

If you click on PySpark, it will open a notebook and connect to a kernel. 

There are a number of available magics to interact with the notebooks; type %%help to list them all. Here's the list of the most important ones:

Magic Example Explanation
info %%info Outputs session information from Livy.
cleanup %%cleanup -f Delete all sessions running on the current Livy endpoint. The -f switch forces the cleanup.
delete %%delete -f -s 0 Deletes the session specified by the -s switch; the -f switch forces the deletion.
configure

%%configure -f

{"executorMemory": "1000M", "executorCores": 4}

Arguably the most useful magic. Allows you to configure your session. Check http://bit.ly/2kSKlXr for the full list of available configuration parameters.
sql

%%sql -o tables -q

SHOW TABLES

Executes an SQL query against the current SparkSession.
local

%%local

a=1

All the code in the notebook cell with this magic will be executed locally against the Python environment.

Once you have configured your session, you will get information back from Livy about the active sessions that are currently running:

Let's try to create a simple data frame using the following code:

from pyspark.sql.types import *

# Generate our data
ListRDD = sc.parallelize([
(123, 'Skye', 19, 'brown'),
(223, 'Rachel', 22, 'green'),
(333, 'Albert', 23, 'blue')
])

# The schema is encoded using StructType
schema = StructType([
StructField("id", LongType(), True),
StructField("name", StringType(), True),
StructField("age", LongType(), True),
StructField("eyeColor", StringType(), True)
])

# Apply the schema to the RDD and create DataFrame
drivers = spark.createDataFrame(ListRDD, schema)

# Creates a temporary view using the data frame
drivers.createOrReplaceTempView("drivers")

Once you execute the preceding code in a cell inside the notebook, only then will the SparkSession be created:

If you execute %%sql magic, you will get the following:

See also

You have been reading a chapter from
PySpark Cookbook
Published in: Jun 2018
Publisher: Packt
ISBN-13: 9781788835367
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image