Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
PySpark Cookbook
PySpark Cookbook

PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

eBook
₹799.99 ₹2621.99
Paperback
₹3276.99
Subscription
Free Trial
Renews at ₹800p/m

What do you get with a Packt Subscription?

Free for first 7 days. ₹800 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

PySpark Cookbook

Installing and Configuring Spark

In this chapter, we will cover how to install and configure Spark, either as a local instance, a multi-node cluster, or in a virtual environment. You will learn the following recipes:

  • Installing Spark requirements
  • Installing Spark from sources
  • Installing Spark from binaries
  • Configuring a local instance of Spark
  • Configuring a multi-node instance of Spark
  • Installing Jupyter
  • Configuring a session in Jupyter
  • Working with Cloudera Spark images

Introduction

We cannot begin a book on Spark (well, on PySpark) without first specifying what Spark is. Spark is a powerful, flexible, open source, data processing and querying engine. It is extremely easy to use and provides the means to solve a huge variety of problems, ranging from processing unstructured, semi-structured, or structured data, through streaming, up to machine learning. With over 1,000 contributors from over 250 organizations (not to mention over 3,000 Spark Meetup community members worldwide), Spark is now one of the largest open source projects in the portfolio of the Apache Software Foundation.

The origins of Spark can be found in 2012 when it was first released; Matei Zacharia developed the first versions of the Spark processing engine at UC Berkeley as part of his PhD thesis. Since then, Spark has become extremely popular, and its popularity stems from a number of reasons:

  • It is fast: It is estimated that Spark is 100 times faster than Hadoop when working purely in memory, and around 10 times faster when reading or writing data to a disk.
  • It is flexible: You can leverage the power of Spark from a number of programming languages; Spark natively supports interfaces in Scala, Java, Python, and R. 
  • It is extendible: As Spark is an open source package, you can easily extend it by introducing your own classes or extending the existing ones. 
  • It is powerful: Many machine learning algorithms are already implemented in Spark so you do not need to add more tools to your stack—most of the data engineering and data science tasks can be accomplished while working in a single environment.
  • It is familiar: Data scientists and data engineers, who are accustomed to using Python's pandas, or R's data.frames or data.tables, should have a much gentler learning curve (although the differences between these data types exist). Moreover, if you know SQL, you can also use it to wrangle data in Spark!
  • It is scalable: Spark can run locally on your machine (with all the limitations such a solution entails). However, the same code that runs locally can be deployed to a cluster of thousands of machines with little-to-no changes. 

For the remainder of this book, we will assume that you are working in a Unix-like environment such as Linux (throughout this book, we will use Ubuntu Server 16.04 LTS) or macOS (running macOS High Sierra); all the code provided has been tested in these two environments. For this chapter (and some other ones, too), an internet connection is also required as we will be downloading a bunch of binaries and sources from the internet. 

We will not be focusing on installing Spark in a Windows environment as it is not truly supported by the Spark developers. However, if you are inclined to try, you can follow some of the instructions you will find online, such as from the following link: http://bit.ly/2Ar75ld.

Knowing how to use the command line and how to set some environment variables on your system is useful, but not really required—we will guide you through the steps.

Installing Spark requirements

Spark requires a handful of environments to be present on your machine before you can install and use it. In this recipe, we will focus on getting your machine ready for Spark installation.

Getting ready

To execute this recipe, you will need a bash Terminal and an internet connection. 

Also, before we start any work, you should clone the GitHub repository for this book. The repository contains all the codes (in the form of notebooks) and all the data you will need to follow the examples in this book. To clone the repository, go to http://bit.ly/2ArlBck, click on the Clone or download button, and copy the URL that shows up by clicking on the icon next to it:

Next, go to your Terminal and issue the following command:

git clone git@github.com:drabastomek/PySparkCookbook.git

If your git environment is set up properly, the whole GitHub repository should clone to your disk. No other prerequisites are required.

How to do it...

There are just truly two main requirements for installing PySpark: Java and Python. Additionally, you can also install Scala and R if you want to use those languages, and we will also check for Maven, which we will use to compile the Spark sources. 

To do this, we will use the checkRequirements.sh script to check for all the requirements: the script is located in the Chapter01 folder from the GitHub repository.

The following code block shows the high-level portions of the script found in the Chapter01/checkRequirements.sh file. Note that some portions of the code were omitted here for brevity:

#!/bin/bash

# Shell script for checking the dependencies
#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017

_java_required=1.8
_python_required=3.4
_r_required=3.1
_scala_required=2.11
_mvn_required=3.3.9

# parse command line arguments
_args_len="$#"
...

printHeader
checkJava
checkPython

if [ "${_check_R_req}" = "true" ]; then
checkR
fi

if [ "${_check_Scala_req}" = "true" ]; then
checkScala
fi

if [ "${_check_Maven_req}" = "true" ]; then
checkMaven
fi

How it works...

First, we will specify all the required packages and their required minimum versions; looking at the preceding code, you can see that Spark 2.3.1 requires Java 1.8+ and Python 3.4 or higher (and we will always be checking for these two environments). Additionally, if you want to use R or Scala, the minimal requirements for these two packages are 3.1 and 2.11, respectively. Maven, as mentioned earlier, will be used to compile the Spark sources, and for doing that, Spark requires at least the 3.3.9 version of Maven.

You can check the Spark requirements here: https://spark.apache.org/docs/latest/index.html 
You can check the requirements for building Spark here: https://spark.apache.org/docs/latest/building-spark.html.

Next, we parse the command-line arguments:

if [ "$_args_len" -ge 0 ]; then
while [[ "$#" -gt 0 ]]
do
key="$1"
case $key in
-m|--Maven)
_check_Maven_req="true"
shift # past argument
;;
-r|--R)
_check_R_req="true"
shift # past argument
;;
-s|--Scala)
_check_Scala_req="true"
shift # past argument
;;
*)
shift # past argument
esac
done
fi

You, as a user, can specify whether you want to check additionally for R, Scala, and Maven dependencies. To do so, run the following code from your command line (the following code will check for all of them):

./checkRequirements.sh -s -m -r

The following is also a perfectly valid usage:

./checkRequirements.sh --Scala --Maven --R

Next, we call three functions: printHeader, checkJava, and checkPython. The printHeader function is nothing more than just a simple way for the script to state what it does and it really is not that interesting, so we will skip it here; it is, however, fairly self-explanatory, so you are welcome to peruse the relevant portions of the checkRequirements.sh script yourself.

Next, we will check whether Java is installed. First, we just print to the Terminal that we are performing checks on Java (this is common across all of our functions, so we will only mention it here):

function checkJava() {
echo
echo "##########################"
echo
echo "Checking Java"
echo

Following this, we will check if the Java environment is installed on your machine:

if type -p java; then
echo "Java executable found in PATH"
_java=java
elif [[ -n "$JAVA_HOME" ]] && [[ -x "$JAVA_HOME/bin/java" ]]; then
echo "Found Java executable in JAVA_HOME"
_java="$JAVA_HOME/bin/java"
else
echo "No Java found. Install Java version $_java_required or higher first or specify JAVA_HOME variable that will point to your Java binaries."
exit
fi

First, we use the type command to check if the java command is available; the type -p command returns the location of the java binary if it exists. This also implies that the bin folder containing Java binaries has been added to the PATH.

If you are certain you have the binaries installed (be it Java, Python, R, Scala, or Maven), you can jump to the Updating PATH section in this recipe to see how to let your computer know where these binaries live.

If this fails, we will revert to checking if the JAVA_HOME environment variable is set, and if it is, we will try to see if it contains the required java binary: [[ -x "$JAVA_HOME/bin/java" ]]. Should this fail, the program will print the message that no Java environment could be found and will exit (without checking for other required packages, like Python).

If, however, the Java binary is found, then we can check its version:

_java_version=$("$_java" -version 2>&1 | awk -F '"' '/version/ {print $2}')
echo "Java version: $_java_version (min.: $_java_required)"

if [[ "$_java_version" < "$_java_required" ]]; then
echo "Java version required is $_java_required. Install the required version first."
exit
fi
echo

 We first execute the java -version command in the Terminal, which would normally produce an output similar to the following screenshot:

We then pipe the previous output to awk to split (the -F switch) the rows at the quote '"' character (and will only use the first line of the output as we filter the rows down to those that contain /version/) and take the second (the $2) element as the version of the Java binaries installed on our machine. We will store it in the _java_version variable, which we also print to the screen using the echo command.

If you do not know what awk is or how to use it, we recommend this book from Packt: http://bit.ly/2BtTcBV.

Finally, we check if the _java_version we just obtained is lower than _java_required. If this evaluates to true, we will stop the execution, instead telling you to install the required version of Java. 

The logic implemented in the checkPython, checkR, checkScala, and checkMaven functions follows in a very similar way. The only differences are in what binary we call and in the way we check the versions:

  • For Python, we run "$_python" --version 2>&1 | awk -F ' ' '{print $2}', as checking the Python version (for Anaconda distribution) would print out the following to the screen: Python 3.5.2 :: Anaconda 2.4.1 (x86_64)
  • For R, we use "$_r" --version 2>&1 | awk -F ' ' '/R version/ {print $3}', as checking the R's version would write (a lot) to the screen; we only use the line that starts with R versionR version 3.4.2 (2017-09-28) -- "Short Summer"
  • For Scala, we utilize "$_scala" -version 2>&1 | awk -F ' ' '{print $5}', given that checking Scala's version prints the following: Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
  • For Maven, we check "$_mvn" --version 2>&1 | awk -F ' ' '/Apache Maven/ {print $3}', as Maven prints out the following (and more!) when asked for its version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T00:58:13-07:00)

If you want to learn more, you should now be able to read the other functions with ease.

There's more...

If any of your dependencies are not installed, you need to install them before continuing with the next recipe. It goes beyond the scope of this book to guide you step-by-step through the installation process of all of these, but here are some helpful links to show you how to do it.

Installing Java

Installing Python

We have been using (and highly recommend) the Anaconda version of Python as it comes with the most commonly used packages included with the installer. It also comes built-in with the conda package management tool that makes installing other packages a breeze.

You can download Anaconda from http://www.continuum.io/downloads; select the appropriate version that will fulfill Spark's requirements. For macOS installation instructions, you can go to http://bit.ly/2zZPuUf and for a Linux installation manual check, you can go to http://bit.ly/2ASLUvg.

Installing R

R is distributed via Comprehensive R Archive Network (CRAN). The macOS version can be downloaded from here, https://cran.r-project.org/bin/macosx/, whereas the Linux one is available here: https://cran.r-project.org/bin/linux/.

Download the version appropriate for your machine and follow the installation instructions on the screen. For the macOS version, you can choose to install just the R core packages without the GUI and everything else as Spark does not require those. 

Installing Scala

Installing Scala is even simpler.

Go to http://bit.ly/2Am757R and download the sbt-*.*.*.tgz archive (at the time of writing this book, the latest version is sbt-1.0.4.tgz). Next, in your Terminal, navigate to the folder you have just downloaded Scala to and issue the following commands:

tar -xvf sbt-1.0.4.tgz
sudo mv sbt-1.0.4/ /opt/scala/

That's it. Now, you can skip to the Updating PATH section in this recipe to update your PATH.

Installing Maven

Maven's installation is quite similar to that of Scala. Go to https://maven.apache.org/download.cgi and download the apache-maven-*.*.*-bin.tar.gz archive. At the time of writing this book, the newest version was 3.5.2. Similarly to Scala, open the Terminal, navigate to the folder you have just downloaded the archive to, and type:

tar -xvf apache-maven-3.5.2-bin.tar.gz
sudo mv apache-maven-3.5.2-bin/ /opt/apache-maven/

Once again, that is it for what you need to do with regards to installing Maven. Check the next subsection for instructions on how to update your PATH.

Updating PATH

Unix-like operating systems (Windows, too) use the concept of a PATH to search for binaries (or executables, in the case of Windows). The PATH is nothing more than a list of folders separated by the colon character ':' that tells the operating system where to look for binaries.

To add something to your PATH (and make it a permanent change), you need to edit either the .bash_profile (macOS) or .bashrc (Linux) files; these are located in the root folder for your user. Thus, to add both Scala and Maven binaries to the PATH, you can do the following (on macOS):

cp ~/.bash_profile ~/.bash_profile_old   # make a copy just in case
echo export SCALA_HOME=/opt/scala >> ~/.bash_profile
echo export MAVEN_HOME=/opt/apache-maven >> ~/.bash_profile
echo PATH=$SCALA_HOME/bin:$MAVEN_HOME/bin:$PATH >> ~/.bash_profile

On Linux, the equivalent looks as follows:

cp ~/.bashrc ~/.bashrc_old   # make a copy just in case
echo export SCALA_HOME=/opt/scala >> ~/.bashrc
echo export MAVEN_HOME=/opt/apache-maven >> ~/.bashrc
echo PATH=$SCALA_HOME/bin:$MAVEN_HOME/bin:$PATH >> ~/.bashrc

The preceding commands simply append to the end of either of the .bash_profile or .bashrc files using the redirection operator >>.

Once you execute the preceding commands, restart your Terminal, and:

echo $PATH

It should now include paths to both the Scala and Maven binaries.

Installing Spark from sources

Spark is distributed in two ways: either as precompiled binaries or as a source code that gives you the flexibility to choose, for example, whether you need support for Hive or not. In this recipe, we will focus on the latter.

Getting ready

To execute this recipe, you will need a bash Terminal and an internet connection. Also, to follow through with this recipe, you will have to have already checked and/or installed all the required environments we went through in the previous recipe. In addition, you need to have administrative privileges (via the sudo command) which will be necessary to move the compiled binaries to the destination folder. 

If you are not an administrator on your machine, you can call the script with the -ns (or --nosudo) parameter. The destination folder will then switch to your home directory and will create a spark folder within it. By default, the binaries will be moved to the /opt/spark folder and that's why you need administrative rights.

No other prerequisites are required.

How to do it...

There are five major steps we will undertake to install Spark from sources (check the highlighted portions of the code):

  1. Download the sources from Spark's website
  2. Unpack the archive

 

  1. Build
  2. Move to the final destination
  3. Create the necessary environmental variables

The skeleton for our code looks as follows (see the Chapter01/installFromSource.sh file):

#!/bin/bash
# Shell script for installing Spark from sources
#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017
_spark_source="http://mirrors.ocf.berkeley.edu/apache/spark/spark-2.3.1/spark-2.3.1.tgz"
_spark_archive=$( echo "$_spark_source" | awk -F '/' '{print $NF}' )
_spark_dir=$( echo "${_spark_archive%.*}" )
_spark_destination="/opt/spark" ... checkOS
printHeader
downloadThePackage
unpack
build
moveTheBinaries
setSparkEnvironmentVariables
cleanUp

How it works...

First, we specify the location of Spark's source code. The _spark_archive contains the name of the archive; we use awk to extract the last element (here, it is specified by the $NF flag) from the _spark_source. The _spark_dir contains the name of the directory our archive will unpack into; in our current case, this will be spark-2.3.1. Finally, we specify our destination folder where we will be going to move the binaries to: it will either be /opt/spark (default) or your home directory if you use the -ns (or --nosudo) switch when calling the ./installFromSource.sh script.

Next, we check the OS name we are using:

function checkOS(){
_uname_out="$(uname -s)"
case "$_uname_out" in
Linux*) _machine="Linux";;
Darwin*) _machine="Mac";;
*) _machine="UNKNOWN:${_uname_out}"
esac
 if [ "$_machine" = "UNKNOWN:${_uname_out}" ]; then
echo "Machine $_machine. Stopping."
exit
fi
}

First, we get the short name of the operating system using the uname command; the -s switch returns a shortened version of the OS name. As mentioned earlier, we only focus on two operating systems: macOS and Linux, so if you try to run this script on Windows or any other system, it will stop. This portion of the code is necessary to set the _machine flag properly: macOS and Linux use different methods to download the Spark source codes and different bash profile files to set the environment variables.

Next, we print out the header (we will skip the code for this part here, but you are welcome to check the Chapter01/installFromSource.sh script). Following this, we download the necessary source codes:

function downloadThePackage() {
...
if [ -d _temp ]; then
sudo rm -rf _temp
fi
 mkdir _temp 
cd _temp
 if [ "$_machine" = "Mac" ]; then
curl -O $_spark_source
elif [ "$_machine" = "Linux"]; then
wget $_spark_source
else
echo "System: $_machine not supported."
exit
fi
}

First, we check whether a _temp folder exists and, if it does, we delete it. Next, we recreate an empty _temp folder and download the sources into it; on macOS, we use the curl method while on Linux, we use wget to download the sources.

Did you notice the ellipsis '...' character in our code? Whenever we use such a character, we omit some less relevant or purely informational portions of the code. They are still present, though, in the sources checked into the GitHub repository.

Once the sources land on our machine, we unpack them using the tar tool, tar -xf $_spark_archive. This happens inside the unpack function.

Finally, we can start building the sources into binaries:

function build(){
...
 cd "$_spark_dir"
./dev/make-distribution.sh --name pyspark-cookbook -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn
}

We use the make-distribution.sh script (distributed with Spark) to create our own Spark distribution, named pyspark-cookbook. The previous command will build the Spark distribution for Hadoop 2.7 and with Hive support. We will also be able to deploy it over YARN. Underneath the hood, the make-distribution.sh script is using Maven to compile the sources.

Once the compilation finishes, we need to move the binaries to the _spark_destination folder:

function moveTheBinaries() {
 ...
if [ -d "$_spark_destination" ]; then
sudo rm -rf "$_spark_destination"
fi
 cd ..
sudo mv $_spark_dir/ $_spark_destination/
}

First, we check if the folder in the destination exists and, if it does, we remove it. Next, we simply move (mv) the $_spark_dir folder to its new home.

This is when you will need to type in the password if you did not use the -ns (or --nosudo) flag when invoking the installFromSource.sh script.

One of the last steps is to add new environment variables to your bash profile file:

function setSparkEnvironmentVariables() {
...
 if [ "$_machine" = "Mac" ]; then
_bash=~/.bash_profile
else
_bash=~/.bashrc
fi
_today=$( date +%Y-%m-%d )
 # make a copy just in case 
if ! [ -f "$_bash.spark_copy" ]; then
cp "$_bash" "$_bash.spark_copy"
fi
 echo >> $_bash 
echo "###################################################" >> $_bash
echo "# SPARK environment variables" >> $_bash
echo "#" >> $_bash
echo "# Script: installFromSource.sh" >> $_bash
echo "# Added on: $_today" >>$_bash
echo >> $_bash
 echo "export SPARK_HOME=$_spark_destination" >> $_bash
echo "export PYSPARK_SUBMIT_ARGS=\"--master local[4]\"" >> $_bash
echo "export PYSPARK_PYTHON=$(type -p python)" >> $_bash
echo "export PYSPARK_DRIVER_PYTHON=jupyter" >> $_bash
 echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"notebook --NotebookApp.open_browser=False --NotebookApp.port=6661\"" >> $_bash

echo "export PATH=$SPARK_HOME/bin:\$PATH" >> $_bash
}

First, we check what OS system we're on and select the appropriate bash profile file. We also grab the current date (the _today variable) so that we can include that information in our bash profile file, and create its safe copy (just in case, and if one does not already exist). Next, we start to append new lines to the bash profile file:

  • We first set the SPARK_HOME variable to the _spark_destination; this is either going to be the /opt/spark or ~/spark location.
  • The PYSPARK_SUBMIT_ARGS variable is used when you invoke pyspark. It instructs Spark to use four cores of your CPU; changing it to --master local[*] will use all the available cores.
  • We specify the PYSPARK_PYTHON variable so, in case of multiple Python installations present on the machine, pyspark will use the one that we checked for in the first recipe.
  • Setting the PYSPARK_DRIVER_PYTHON to jupyter will start a Jupyter session (instead of the PySpark interactive shell).
  • The PYSPARK_DRIVER_PYTHON_OPS instructs Jupyter to:
    • Start a notebook
    • Do not open the browser by default: use the --NotebookApp.open_browser=False flag
    • Change the default port (8888) to 6661 (because we are big fans of not having things at default for safety reasons)

Finally, we add the bin folder from SPARK_HOME to the PATH.

The last step is to cleanUp after ourselves; we simply remove the _temp folder with everything in it. 

Now that we have installed Spark, let's test if everything works. First, in order to make all the environment variables accessible in the Terminal's session, we need to refresh the bash session: you can either close and reopen the Terminal, or execute the following command (on macOS):

source ~/.bash_profile

On Linux, execute the following command:

source ~/.bashrc

Next, you should be able to execute the following:

pyspark --version

If all goes well, you should see a response similar to the one shown in the following screenshot:

There's more...

Instead of using the make-distribution.sh script from Spark, you can use Maven directly to compile the sources. For instance, if you wanted to build the default version of Spark, you could simply type (from the _spark_dir folder):

./build/mvn clean package

This would default to Hadoop 2.6. If your version of Hadoop was 2.7.2 and was deployed over YARN, you can do the following:

./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.2 -DskipTests clean package

You can also use Scala to build Spark:

./build/sbt package

See also

Installing Spark from binaries

Installing Spark from already precompiled binaries is even easier than doing the same from the sources. In this recipe, we will show you how to do this by downloading the binaries from the web or by using pip.

Getting ready

To execute this recipe, you will need a bash Terminal and an internet connection. Also, to follow through with this recipe, you will need to have already checked and/or installed all the required environments we went through in the Installing Spark requirements recipe. In addition, you need to have administrative privileges (via the sudo command), as these will be necessary to move the compiled binaries to the destination folder. 

If you are not an administrator on your machine, you can call the script with the -ns (or --nosudo) parameter. The destination folder will then switch to your home directory and will create a spark folder within it; by default, the binaries will be moved to the /opt/spark folder and that's why you need administrative rights.

No other prerequisites are required.

How to do it...

To install from the binaries, we only need four steps (see the following source code) as we do not need to compile the sources:

  1. Download the precompiled binaries from Spark's website.
  2. Unpack the archive.
  3. Move to the final destination.
  4. Create the necessary environmental variables.

The skeleton for our code looks as follows (see the Chapter01/installFromBinary.sh file):

#!/bin/bash
# Shell script for installing Spark from binaries

#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017
_spark_binary="http://mirrors.ocf.berkeley.edu/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz"
_spark_archive=$( echo "$_spark_binary" | awk -F '/' '{print $NF}' )
_spark_dir=$( echo "${_spark_archive%.*}" )
_spark_destination="/opt/spark" ... checkOS
printHeader
downloadThePackage
unpack
moveTheBinaries
setSparkEnvironmentVariables
cleanUp

How it works...

The code is exactly the same as with the previous recipe so we will not be repeating it here; the only major difference is that we do not have the build stage in this script, and the _spark_source variable is different. 

As in the previous recipe, we start by specifying the location of Spark's source code, which is in _spark_source. The _spark_archive contains the name of the archive; we use awk to extract the last element. The _spark_dir contains the name of the directory our archive will unpack into; in our current case, this will be spark-2.3.1. Finally, we specify our destination folder where we will be moving the binaries to: it will either be /opt/spark (default) or your home directory if you use the -ns (or --nosudo) switch when calling the ./installFromBinary.sh script.

Next, we check the OS name. Depending on whether you work in a Linux or macOS environment, we will use different tools to download the archive from the internet (check the downloadThePackage function). Also, when setting up the environment variables, we will output to different bash profile files: the .bash_profile on macOS and the .bashrc on Linux (check the setEnvironmentVariables function). 

Following the OS check, we download the package: on macOS, we use curl and on Linux, we use wget tools to attain this goal. Once the package is downloaded, we unpack it using the tar tool, and then move it to its destination folder. If you are running with sudo privileges (without the -ns or --nosudo parameters), the binaries will be moved to the /opt/spark folder; if not—they will end up in the ~/spark folder.

Finally, we add environment variables to the appropriate bash profile files: check the previous recipe for an explanation of what is being added and for what reason. Also, follow the steps at the end of the previous recipe to test if your environment is working properly.

There's more...

Nowadays, there is an even simpler way to install PySpark on your machine, that is, by using pip.

pip is Python's package manager. If you installed Python 2.7.9 or Python 3.4 from http://python.org, then pip is already present on your machine (the same goes for our recommended Python distribution—Anaconda). If you do not have pip, you can easily install it from here:  https://pip.pypa.io/en/stable/installing/.

To install PySpark via pip, just issue the following command in the Terminal:

pip install pyspark

Or, if you use Python 3.4+, you may also try:

pip3 install pyspark

You should see the following screen in your Terminal:

Configuring a local instance of Spark

There is actually not much you need to do to configure a local instance of Spark. The beauty of Spark is that all you need to do to get started is to follow either of the previous two recipes (installing from sources or from binaries) and you can begin using it. In this recipe, however, we will walk you through the most useful SparkSession configuration options.

Getting ready

In order to follow this recipe, a working Spark environment is required. This means that you will have to have gone through the previous three recipes and have successfully installed and tested your environment, or had a working Spark environment already set up.

No other prerequisites are necessary.

How to do it...

To configure your session, in a Spark version which is lower that version 2.0, you would normally have to create a SparkConf object, set all your options to the right values, and then build the SparkContext ( SqlContext if you wanted to use DataFrames, and HiveContext if you wanted access to Hive tables). Starting from Spark 2.0, you just need to create a SparkSession, just like in the following snippet:

spark = SparkSession.builder \
.master("local[2]") \
.appName("Your-app-name") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

How it works...

To create a SparkSession, we will use the Builder class (accessed via the .builder property of the SparkSession class). You can specify some basic properties of the SparkSession here:

  • The .master(...) allows you to specify the driver node (in our preceding example, we would be running a local session with two cores)
  • The .appName(...) gives you means to specify a friendly name for your app
  • The .config(...) method allows you to refine your session's behavior further; the list of the most important SparkSession parameters is outlined in the following table
  • The .getOrCreate() method returns either a new SparkSession if one has not been created yet, or returns a pointer to an already existing SparkSession

The following table gives an example list of the most useful configuration parameters for a local instance of Spark:

Some of these parameters are also applicable if you are working in a cluster environment with multiple worker nodes. In the next recipe, we will explain how to set up and administer a multi-node Spark cluster deployed over YARN. 
Parameter Function Default
spark.app.name Specifies a friendly name for your application (none)
spark.driver.cores Number of cores for the driver node to use. This is only applicable for app deployments in a cluster mode (see the following spark.submit.deployMode parameter). 1
spark.driver.memory Specifies the amount of memory for the driver process. If using spark-submit in client mode, you should specify this in a command line using --driver-memory switch rather than configuring your session using this parameter as JVM would have already started at this point. 1g
spark.executor.cores Number of cores for an executor to use. Setting this parameter while running locally allows you to use all the available cores on your machine. 1 in YARN deployment, all available cores on the worker in standalone and Mesos deployments

spark.executor.memory

Specifies the amount of memory per each executor process. 1g
spark.submit.pyFiles List of .zip, .egg, or .py files, separated by commas. These will be added to the PYTHONPATH so that they are accessible for Python apps. (none)
spark.submit.deployMode Deploy mode of the Spark driver program. Specifying 'client' will launch the driver program locally on the machine (it can be the driver node), while specifying 'cluster' will utilize one of the nodes on a remote cluster. (none)
spark.pyspark.python Python binary that should be used by the driver and all the executors. (none)

 

There are some environment variables that also allow you to further fine-tune your Spark environment. Specifically, we are talking about the PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables. We have already covered these in the Installing Spark from sources recipe.

See also

Configuring a multi-node instance of Spark

Setting up a multi-node Spark cluster requires quite a few more steps to get it ready. In this recipe, we will go step-by-step through the script that will help you with this process; the script needs to run on the driver node and all the executors to set up the environment.

Getting ready

In this recipe, we are solely focusing on a Linux environment (we are using Ubuntu Server 16.04 LTS). The following prerequisites are required before you can follow with the rest of the recipe:

  • A clean installation of a Linux distribution; in our case, we have installed Ubuntu Server 16.04 LTS on each machine in our cluster of three Dell R710s.
  • Each machine needs to be connected to the internet and accessible from your local machine. You will need the machines' IPs and their hostnames; on Linux, you can check the IP by issuing the ifconfig command and reading the inet addr. To check your hostname, type at cat/etc/hostname
  • On each server, we added a user group called hadoop. Following this, we have created a user called hduser and added it to the hadoop group. Also, make sure that the hduser has sudo rights. If you do not know how to do this, check the See also section of this recipe.
  • Make sure you have added the ability to reach your servers via SSH. If you cannot do this, run sudo apt-get install openssh-server openssh-client on each server to install the necessary environments.
  • If you want to read and write to Hadoop and Hive, you need to have these two environments installed and configured on your cluster. Check https://data-flair.training/blogs/install-hadoop-2-x-on-ubuntu/ for Hadoop installation and configuration and http://www.bogotobogo.com/Hadoop/BigData_hadoop_Hive_Install_On_Ubuntu_16_04.php for Hive.
If you have these two environments set up, some of the steps from our script would be obsolete. However, we will present all of the steps as follows, assuming you only want the Spark environment.

No other prerequisites are required.

For the purpose of automating the deployment of the Spark environment in a cluster setup, you will also have to:

  1. Create a hosts.txt file. Each entry on the list is the IP address of one of the servers followed by two spaces and a hostname. Do not delete the driver: nor executors: lines. Also, note that we only allow one driver in our cluster (some clusters support redundant drivers). An example of the content of this file is as follows:
driver:
192.168.17.160 pathfinder
executors:
192.168.17.161 discovery1
192.168.17.162 discovery2
  1. On your local machine, add the IPs and hostnames to your /etc/hosts file so you can access the servers via hostnames instead of IPs (once again, we are assuming you are running a Unix-like system such as macOS or Linux). For example, the following command will add pathfinder to our /etc/hosts file: sudo echo 192.168.1.160  pathfinder >> /etc/hosts. Repeat this for all machines from your server.
  1. Copy the hosts.txt file to each machine in your cluster; we assume the file will be placed in the root folder for the hduser. You can attain this easily with the scp hosts.txt hduser@<your-server-name>:~ command, where <your-server-name> is the hostname of the machine.
  2. To run the installOnRemote.sh script (see the Chapter01/installOnRemote.sh file) from your local machine, do the following: ssh -tq hduser@<your-server-name> "echo $(base64 -i installOnRemote.sh) | base64 -d | sudo bash". We will go through these steps in detail in the installOnRemote.sh script in the next section.
  3. Follow the prompts on the screen to finalize the installation and configuration steps. Repeat step 4 for each machine in your cluster.

How to do it...

The installOnRemote.sh script for this recipe can be found in the Chapter01 folder in the GitHub repository: http://bit.ly/2ArlBck. Some portions of the script are very similar to the ones we have outlined in the previous recipes, so we will skip those; you can refer to previous recipes for more information (especially the Installing Spark requirements and the Installing Spark from binaries recipes). 

The top-level structure of the script is as follows:

#!/bin/bash
# Shell script for installing Spark from binaries
# on remote servers
#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/9/2017
_spark_binary="http://mirrors.ocf.berkeley.edu/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz"
_spark_archive=$( echo "$_spark_binary" | awk -F '/' '{print $NF}' )
_spark_dir=$( echo "${_spark_archive%.*}" )
_spark_destination="/opt/spark"
_java_destination="/usr/lib/jvm/java-8-oracle"

_python_binary="https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh"

_python_archive=$( echo "$_python_binary" | awk -F '/' '{print $NF}' )
_python_destination="/opt/python"
_machine=$(cat /etc/hostname)
_today=$( date +%Y-%m-%d )
_current_dir=$(pwd) # store current working directory
...
printHeader
readIPs
checkJava
installScala
installPython

updateHosts
configureSSH
downloadThePackage
unpack
moveTheBinaries
setSparkEnvironmentVariables
updateSparkConfig
cleanUp

We have highlighted the portions of the script that are more relevant to this recipe in bold font.

How it works...

As with the previous recipes, we will first specify where we are going to download the Spark binaries from and create all the relevant global variables we are going to use later. 

Next, we read in the hosts.txt file:

function readIPs() {
input="./hosts.txt"
 driver=0
executors=0
_executors=""

IFS=''
while read line
do
 if [[ "$master" = "1" ]]; then
_driverNode="$line"
driver=0
fi
 if [[ "$slaves" = "1" ]]; then
_executors=$_executors"$line\n"
fi
 if [[ "$line" = "driver:" ]]; then
driver=1
executors=0
fi
 if [[ "$line" = "executors:" ]]; then
executors=1
driver=0
fi
 if [[ -z "${line}" ]]; then
continue
fi
done < "$input"
}

We store the path to the file in the input variable. The driver and the executors variables are flags we use to skip the "driver:" and the "executors:" lines from the input file. The _executors empty string will store the list of executors, which are delimited by a newline "\n".

IFS stands for internal field separator. Whenever bash reads a line from a file, it will split it on that character. Here, we will set it to an empty character '' so that we preserve the double spaces between the IP address and the hostname. 

Next, we start reading the file, line-by-line. Let's see how the logic works inside the loop; we'll start a bit out of order so that the logic is easier to understand:

  • If the line we just read equals to "driver:" (the if [[ "$line" = "driver:" ]]; conditional), we set the driver flag to 1 so that when the next line is read, we store it as a _driverNode (this is done inside the if [[ "$driver" = "1" ]]; conditional). Inside that conditional, we also reset the executors flag to 0. The latter is done in case you start with executors first, followed by a single driver in the hosts.txt. Once the line with the driver node information is read, we reset the driver flag to 0.
  • On the other hand, if the line we just read equals to "executors:" (the if [[ "$line" = "executors:" ]]; conditional), we set the executors flag to 1 (and reset the driver flag to 0). This guarantees that the next line read will be appended to the _executors string, separated by the "\n" newline character (this happens inside the if [[ "$executors" = "1" ]]; conditional). Note that we do not set the executor flag to 0 as we allow for more than one executor.
  • If we encounter an empty linewhich we can check for in bash with the if [[ -z "${line}" ]]; conditional—we skip it.

You might notice that we use the "<" redirection pipe to read in the data (indicated here by the input variable). 

You can read more about the redirection pipes here: http://www.tldp.org/LDP/abs/html/io-redirection.html.

Since Spark requires Java and Scala to work, next we have to check if Java is installed, and we will install Scala (as it normally isn't present while Java might be). This is achieved with the following functions:

function checkJava() {
if type -p java; then
echo "Java executable found in PATH"
_java=java
elif [[ -n "$JAVA_HOME" ]] && [[ -x "$JAVA_HOME/bin/java" ]]; then
echo "Found Java executable in JAVA_HOME"
_java="$JAVA_HOME/bin/java"
else
echo "No Java found. Install Java version $_java_required or higher first or specify JAVA_HOME variable that will point to your Java binaries."
installJava
fi
}
function installJava() {
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
}
function installScala() {
sudo apt-get install scala
}

function installPython() {
curl -O "$_python_binary"
chmod 0755 ./"$_python_archive"
sudo bash ./"$_python_archive" -b -u -p "$_python_destination"
}

The logic here doesn't differ much from what we presented in the Installing Spark requirements recipe. The only notable difference in the checkJava function is that if we do not find Java on the PATH variable or inside the JAVA_HOME folder, we do not exit but run installJava, instead. 

There are many ways to install Java; we have already presented you with one of them earlier in this book—check the Installing Java section in the Installing Spark requirements recipe. Here, we used the built-in apt-get tool.

The apt-get tool is a convenient, fast, and efficient utility for installing packages on your Linux machine. APT stands for Advanced Packaging Tool.

First, we install the python-software-properties. This set of tools provides an abstraction of the used apt repositories. It enables easy management of distribution as well as independent software vendor software sources. We need this as in the next line we add the add-apt-repository; we add a new repository as we want the Oracle Java distribution. The sudo apt-get update command refreshes the contents of the repositories and, in our current case, fetches all the packages available in  ppa:webupd8team/java. Finally, we install the Java package: just follow the prompts on the screen. We will install Scala the same way.

The default location where the package should install is /usr/lib/jvm/java-8-oracle. If this is not the case or you want to install it in a different folder, you will have to alter the _java_destination variable inside the script to reflect the new destination.

The advantage of using this tool is this: if there are already Java and Scala environments installed on a machine, using apt-get will either skip the installation (if the environment is up-to-date with the one available on the server) or ask you to update to the newest version.

We will also install the Anaconda distribution of Python (as mentioned many times previously, since we highly recommend this distribution). To achieve this goal, we must download the Anaconda3-5.0.1-Linux-x86_64.sh script first and then follow the prompts on the screen. The -b parameter to the script will not update the .bashrc file (we will do that later), the -u switch will update the Python environment in case /usr/local/python already exists, and -p will force the installation to that folder.

Having passed the required installation steps, we will now update the /etc/hosts files on the remote machines:

function updateHosts() {
 _hostsFile="/etc/hosts"
 # make a copy (if one already doesn't exist)
if ! [ -f "/etc/hosts.old" ]; then
sudo cp "$_hostsFile" /etc/hosts.old
fi
 t="###################################################\n"
t=$t"#\n"
t=$t"# IPs of the Spark cluster machines\n"
t=$t"#\n"
t=$t"# Script: installOnRemote.sh\n"
t=$t"# Added on: $_today\n"
t=$t"#\n"
t=$t"$_driverNode\n"
t=$t"$_executors\n"
 sudo printf "$t" >> $_hostsFile
}

This is a simple function that, first, creates a copy of the /etc/hosts file, and then appends the IPs and hostnames of the machines in our cluster. Note that the format required by the /etc/hosts file is the same as in the hosts.txt file we use: per row, an IP address of the machine followed by two spaces followed by the hostname.

We use two spaces for readability purposes—one space separating an IP and the hostname would also work.

Also, note that we do not use the echo command here, but printf; the reason behind this is that the printf command prints out a formatted version of the string, properly handling the newline "\n" characters.

Next, we configure the passwordless SSH sessions (check the following See also subsection) to aid communication between the driver node and the executors:

function configureSSH() {
# check if driver node
IFS=" "
read -ra temp <<< "$_driverNode"
_driver_machine=( ${temp[1]} )
_all_machines="$_driver_machine\n"

if [ "$_driver_machine" = "$_machine" ]; then
# generate key pairs (passwordless)
sudo -u hduser rm -f ~/.ssh/id_rsa
sudo -u hduser ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

IFS="\n"
read -ra temp <<< "$_executors"
for executor in ${temp[@]}; do
# skip if empty line
if [[ -z "${executor}" ]]; then
continue
fi

# split on space
IFS=" "
read -ra temp_inner <<< "$executor"
echo
echo "Trying to connect to ${temp_inner[1]}"

cat ~/.ssh/id_rsa.pub | ssh "hduser"@"${temp_inner[1]}" 'mkdir -p .ssh && cat >> .ssh/authorized_keys'

_all_machines=$_all_machines"${temp_inner[1]}\n"
done
fi

echo "Finishing up the SSH configuration"
}

Inside this function, we first check if we are on the driver node, as defined in the hosts.txt file, as we only need to perform these tasks on the driver. The read -ra temp <<< "$_driverNode" command reads the _driverNode (in our case, it is 192.168.1.160  pathfinder), and splits it at the space character (remember what IFS stands for?). The -a switch instructs the read method to store the split _driverNode string in the temp array and the -r parameter makes sure that the backslash does not act as an escape character. We store the name of the driver in the _driver_machine variable and append it to the _all_machines string (we will use this later).

If we are executing this script on the driver machine, the first thing we must do is remove the old SSH key (using the rm function with the -f, force switch) and create a new one. The sudo -u hduser switch allows us to perform these actions as the hduser (instead of the root user).

When we submit the script to run from our local machine, we start an SSH session as a root on the remote machine. You will see how this is done shortly, so take our word on that for now. 

We will use the ssh-keygen method to create the SSH key pair. The -t switch allows us to select the encryption algorithm (we are using RSA encryption), the -P switch determines the password to use (we want this passwordless, so we choose ""), and the -f parameter specifies the filename for storing the keys.

Next, we loop through all the executors: we need to append the contents of ~/.ssh/id_rsa.pub to their ~/.ssh/authorized_keys files. We split the _executors at the "\n" character and loop through all of them. To deliver the contents of the id_rsa.pub file to the executors, we use the cat tool to print out the contents of the id_rsa.pub file and then pipe it to the ssh tool. The first parameter we pass to the ssh is the username and the hostname we want to connect to. Next, we pass the commands we want to execute on the remote machine. First, we attempt to create the .ssh folder if one does not exist. This is followed by outputting the id_rsa.pub file to .ssh/authorized_keys.

Following the SSH session's configurations on the cluster, we download the Spark binaries, unpack them, and move them to _spark_destination.

We have outlined these steps in the Installing Spark from sources and Installing Spark from binaries sections, so we recommend that you check them out.

Finally, we need to set two Spark configuration files: the spark-env.sh and the slaves files:

function updateSparkConfig() {
cd $_spark_destination/conf

sudo -u hduser cp spark-env.sh.template spark-env.sh
echo "export JAVA_HOME=$_java_destination" >> spark-env.sh
echo "export SPARK_WORKER_CORES=12" >> spark-env.sh

sudo -u hduser cp slaves.template slaves
printf "$_all_machines" >> slaves
}

We need to append the JAVA_HOME variable to spark-env.sh so that Spark can find the necessary libraries. We must also specify the number of cores per worker to be 12; this goal is attained by setting the SPARK_WORKER_CORES variable.

You might want to tune the SPARK_WORKER_CORES value to your needs. Check this spreadsheet for help: http://c2fo.io/img/apache-spark-config-cheatsheet/C2FO-Spark-Config-Cheatsheet.xlsx (which is available from here: http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/).

Next, we have to output the hostnames of all the machines in our cluster to the slaves file.

In order to execute the script on the remote machine, and since we need to run it in an elevated mode (as root using sudo), we need to encrypt the script before we send it over the wire. An example of how this is done is as follows (from macOS to remote Linux):

ssh -tq hduser@pathfinder "echo $(base64 -i installOnRemote.sh) | base64 -d | sudo bash"

Or from Linux to remote Linux:

ssh -tq hduser@pathfinder "echo $(base64 -w0 installOnRemote.sh) | base64 -d | sudo bash"

The preceding script uses the base64 encryption tool to encrypt the installOnRemote.sh script before pushing it over to the remote. Once on the remote, we once again use base64 to decrypt the script (the -d switch) and run it as root (via sudo). Note that in order to run this type of script, we also pass the -tq switch to the ssh tool; the -t option forces a pseudo Terminal allocation so that we can execute arbitrary screen-based scripts on the remote machine, and the -q option quiets all the messages but those from our script. 

Assuming all goes well, once the script finishes executing on all your machines, Spark has been successfully installed and configured on your cluster. However, before you can use Spark, you need either to close the connection to your driver and SSH to it again, or type:

source ~/.bashrc

This is so that the newly created environment variables are available, and your PATH is updated.

To start your cluster, you can type:

start-all.sh

And all the machines in the cluster should be coming to life and be recognized by Spark.

In order to check if everything started properly, type:

jps

And it should return something similar to the following (in our case, we had three machines in our cluster):

40334 Master
41297 Worker
41058 Worker

See also

Here's a list of useful links that might help you to go through with this recipe:

Installing Jupyter

Jupyter provides a means to conveniently cooperate with your Spark environment. In this recipe, we will guide you in how to install Jupyter on your local machine.

Getting ready

We require a working installation of Spark. This means that you will have followed the steps outlined in the first, and either the second or third recipes. In addition, a working Python environment is also required.

No other prerequisites are required.

How to do it...

If you do not have pip installed on your machine, you will need to install it before proceeding.

  1. To do this, open your Terminal and type (on macOS):
curl -O https://bootstrap.pypa.io/get-pip.py

Or the following on Linux:

wget https://bootstrap.pypa.io/get-pip.py
  1. Next, type (applies to both operating systems):
python get-pip.py

This will install pip on your machine.

  1. All you have to do now is install Jupyter with the following command:
pip install jupyter

How it works...

pip is a management tool for installing Python packages for PyPI, the Python Package Index. This service hosts a wide range of Python packages and is the easiest and quickest way to distribute your Python packages.

However, calling pip install does not only search for the packages on PyPI: in addition, VCS project URLs, local project directories, and local or remote source archives are also scanned.

Jupyter is one of the most popular interactive shells that supports developing code in a wide variety of environments: Python is not the only one that's supported.

Directly from http://jupyter.org:

"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."

Another way to install Jupyter, if you are using Anaconda distribution for Python, is to use its package management tool, the conda. Here's how:

conda install jupyter

Note that pip install will also work in Anaconda.

There's more...

Now that you have Jupyter on your machine, and assuming you followed the steps of either the Installing Spark from sources or the Installing Spark from binaries recipes, you should be able to start using Jupyter to interact with PySpark.

To refresh your memory, as part of installing Spark scripts, we have appended two environment variables to the bash profile file: PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS. Using these two environment variables, we set the former to use jupyter and the latter to start a notebook service. 

If you now open your Terminal and type:

pyspark

When you open your browser and navigate to http://localhost:6661, you should see a window not that different from the one in the following screenshot:

See also

Configuring a session in Jupyter

Working in Jupyter is great as it allows you to develop your code interactively, and document and share your notebooks with colleagues. The problem, however, with running Jupyter against a local Spark instance is that the SparkSession gets created automatically and by the time the notebook is running, you cannot change much in that session's configuration.

In this recipe, we will learn how to install Livy, a REST service to interact with Spark, and sparkmagic, a package that will allow us to configure sessions interactively as well:

Source: http://bit.ly/2iO3EwC

Getting ready

We assume that you either have installed Spark via binaries or compiled the sources as we have shown you in the previous recipes. In other words, by now, you should have a working Spark environment. You will also need Jupyter: if you do not have it, follow the steps from the previous recipe to install it. 

No other prerequisites are required.

How to do it...

To install Livy and sparkmagic, we have created a script that will do this automatically with minimal interaction from you. You can find it in the Chapter01/installLivy.sh folder. You should be familiar with most of the functions that we're going to use here by now, so we will focus only on those that are different (highlighted in bold in the following code). Here is the high-level view of the script's structure:

#!/bin/bash

# Shell script for installing Spark from binaries
#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017

_livy_binary="http://mirrors.ocf.berkeley.edu/apache/incubator/livy/0.4.0-incubating/livy-0.4.0-incubating-bin.zip"
_livy_archive=$( echo "$_livy_binary" | awk -F '/' '{print $NF}' )
_livy_dir=$( echo "${_livy_archive%.*}" )
_livy_destination="/opt/livy"
_hadoop_destination="/opt/hadoop"
...
checkOS
printHeader
createTempDir
downloadThePackage $( echo "${_livy_binary}" )
unpack $( echo "${_livy_archive}" )
moveTheBinaries $( echo "${_livy_dir}" ) $( echo "${_livy_destination}" )

# create log directory inside the folder
mkdir -p "$_livy_destination/logs"

checkHadoop
installJupyterKernels
setSparkEnvironmentVariables
cleanUp

How it works...

As with all other scripts we have presented so far, we will begin by setting some global variables.

If you do not know what these mean, check the Installing Spark from sources recipe.

Livy requires some configuration files from Hadoop. Thus, as part of this script, we allow you to install Hadoop should it not be present on your machine. That is why we now allow you to pass arguments to the downloadThePackage, unpack, and moveTheBinaries functions.

The changes to the functions are fairly self-explanatory, so for the sake of space, we will not be pasting the code here. You are more than welcome, though, to peruse the relevant portions of the installLivy.sh script.

Installing Livy drills down literally to downloading the package, unpacking it, and moving it to its final destination (in our case, this is /opt/livy). 

Checking if Hadoop is installed is the next thing on our to-do list. To run Livy with local sessions, we require two environment variables: SPARK_HOME and HADOOP_CONF_DIR; the SPARK_HOME is definitely set but if you do not have Hadoop installed, you most likely will not have the latter environment variable set:

function checkHadoop() {
if type -p hadoop; then
echo "Hadoop executable found in PATH"
_hadoop=hadoop
elif [[ -n "$HADOOP_HOME" ]] && [[ -x "$HADOOP_HOME/bin/hadoop" ]]; then
echo "Found Hadoop executable in HADOOP_HOME"
_hadoop="$HADOOP_HOME/bin/hadoop"
else
echo "No Hadoop found. You should install Hadoop first. You can still continue but some functionality might not be available. "
echo
echo -n "Do you want to install the latest version of Hadoop? [y/n]: "
read _install_hadoop

case "$_install_hadoop" in
y*) installHadoop ;;
n*) echo "Will not install Hadoop" ;;
*) echo "Will not install Hadoop" ;;
esac
fi
}

function installHadoop() {
_hadoop_binary="http://mirrors.ocf.berkeley.edu/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz"
_hadoop_archive=$( echo "$_hadoop_binary" | awk -F '/' '{print $NF}' )
_hadoop_dir=$( echo "${_hadoop_archive%.*}" )
_hadoop_dir=$( echo "${_hadoop_dir%.*}" )

downloadThePackage $( echo "${_hadoop_binary}" )
unpack $( echo "${_hadoop_archive}" )
moveTheBinaries $( echo "${_hadoop_dir}" ) $( echo "${_hadoop_destination}" )
}

The checkHadoop function first checks if the hadoop binary is present on the PATH; if not, it will check if the HADOOP_HOME variable is set and, if it is, it will check if the hadoop binary can be found inside the $HADOOP_HOME/bin folder. If both attempts fail, the script will ask you if you want to install the latest version of Hadoop; the default answer is n but if you answer y, the installation will begin.

Once the installation finishes, we will begin installing the additional kernels for the Jupyter Notebooks.

A kernel is a piece of software that translates the commands from the frontend notebook to the backend environment (like Python). For a list of available Jupyter kernels check out the following link: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels. Here are some instructions on how to develop a kernel yourself: http://jupyter-client.readthedocs.io/en/latest/kernels.html.

Here's the function that handles the kernel's installation:

function installJupyterKernels() {
# install the library
pip install sparkmagic
echo

# ipywidgets should work properly
jupyter nbextension enable --py --sys-prefix widgetsnbextension
echo

# install kernels
# get the location of sparkmagic
_sparkmagic_location=$(pip show sparkmagic | awk -F ':' '/Location/ {print $2}')

_temp_dir=$(pwd) # store current working directory

cd $_sparkmagic_location # move to the sparkmagic folder
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel

echo

# enable the ability to change clusters programmatically
jupyter serverextension enable --py sparkmagic
echo

# install autowizwidget
pip install autovizwidget

cd $_temp_dir
}

First, we install the sparkmagic package for Python. Quoting directly from https://github.com/jupyter-incubator/sparkmagic:

"Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter Notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment."

The following command enables the Javascript extensions in Jupyter Notebooks so that ipywidgets can work properly; if you have an Anaconda distribution of Python, this package will be installed automatically.

Following this, we install the kernels. We need to switch to the folder where sparkmagic was installed into. The pip show <package> command displays all relevant information about the installed packages; from the output, we only extract the Location using awk.

To install the kernels, we use the jupyter-kernelspec install <kernel> command. For example, the command will install the sparkmagic kernel for the Scala API of Spark:

jupyter-kernelspec install sparkmagic/kernels/sparkkernel 

Once all the kernels are installed, we enable Jupyter to use sparkmagic so that we can change clusters programmatically. Finally, we will install the autovizwidget, an auto-visualization library for pandas dataframes.

This concludes the Livy and sparkmagic installation part.

There's more...

Now that we have everything in place, let's see what this can do. 

First, start Jupyter (note that we do not use the pyspark command):

jupyter notebook

You should now be able to see the following options if you want to add a new notebook:

If you click on PySpark, it will open a notebook and connect to a kernel. 

There are a number of available magics to interact with the notebooks; type %%help to list them all. Here's the list of the most important ones:

Magic Example Explanation
info %%info Outputs session information from Livy.
cleanup %%cleanup -f Delete all sessions running on the current Livy endpoint. The -f switch forces the cleanup.
delete %%delete -f -s 0 Deletes the session specified by the -s switch; the -f switch forces the deletion.
configure

%%configure -f

{"executorMemory": "1000M", "executorCores": 4}

Arguably the most useful magic. Allows you to configure your session. Check http://bit.ly/2kSKlXr for the full list of available configuration parameters.
sql

%%sql -o tables -q

SHOW TABLES

Executes an SQL query against the current SparkSession.
local

%%local

a=1

All the code in the notebook cell with this magic will be executed locally against the Python environment.

Once you have configured your session, you will get information back from Livy about the active sessions that are currently running:

Let's try to create a simple data frame using the following code:

from pyspark.sql.types import *

# Generate our data
ListRDD = sc.parallelize([
(123, 'Skye', 19, 'brown'),
(223, 'Rachel', 22, 'green'),
(333, 'Albert', 23, 'blue')
])

# The schema is encoded using StructType
schema = StructType([
StructField("id", LongType(), True),
StructField("name", StringType(), True),
StructField("age", LongType(), True),
StructField("eyeColor", StringType(), True)
])

# Apply the schema to the RDD and create DataFrame
drivers = spark.createDataFrame(ListRDD, schema)

# Creates a temporary view using the data frame
drivers.createOrReplaceTempView("drivers")

Once you execute the preceding code in a cell inside the notebook, only then will the SparkSession be created:

If you execute %%sql magic, you will get the following:

See also

Working with Cloudera Spark images

Cloudera is a company that was founded in 2008 by ex-employees of Google, Yahoo!, Oracle, and Facebook. It was an early adopter of open source technologies like Apache Hadoop when it was still fresh from the oven; as a matter of a fact, the author of Hadoop itself joined the company shortly thereafter. Today, Cloudera sells licenses for a broad array of open source products, mostly from the Apache Software Foundation, and also provides consulting services.

In this recipe, we will look at a free virtual image from Cloudera that we can use to learn how to use the newest technologies supported by the company. 

Getting ready

To go through this recipe, you will need a working installation of a VirtualBox, a free virtualization tool from Oracle.

To run the VMs, you will need:

  • A 64-bit host; Windows 10, macOS, and most of the Linux distributions are 64-bit systems
  • A minimum 4 GB of RAM dedicated for the VM, thus a system with a minimum of 8 GB of RAM is required

No other prerequisites are required.

How to do it...

To begin with, in order to download the Cloudera QuickStart VM:

  1. Go to https://www.cloudera.com/downloads/quickstart_vms/5-12.html.
  2. Select VirtualBox as your platform from the dropdown on the right, and click on the Get it now button.
  3. A window to register will show up; fill it in as appropriate and follow the instructions on the screen:

Note, that it is a 6 GB+ download, so it may take a while. 

  1. Once downloaded, open the VirtualBox.
  2. Go to File | Import appliance, click on the button next to the path selection, and find the .ovf file (it should be accompanied by the .vmdk file, which is appropriate for the version you just downloaded).
On macOS, the image is automatically decompressed upon downloading. On Windows and Linux, you might need to unzip the archive first.

You should see a progress bar that is similar to this one:

Once imported, you should see a window like this:

  1. If you now click on Start, you should see a new window pop up, and Cloudera VM (that is built on CentOS) should start booting up. Once done, a window similar to the following one should show up on your screen:

How it works...

There is really not much to configure: Cloudera QuickStart VM has everything you need to get going. As a matter of fact, this is a much simpler solution for Windows users than installing all the necessary environments. However, at the time of writing this book, it only comes with Spark 1.6.0:

Nothing, however, can stop you from upgrading to Spark 2.3.1 by following either the Installing Spark from sources or Installing Spark from binaries recipes we presented earlier in this book. 

 

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform effective data processing, machine learning, and analytics using PySpark
  • Overcome challenges in developing and deploying Spark solutions using Python
  • Explore recipes for efficiently combining Python and Apache Spark to process data

Description

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

Who is this book for?

The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.

What you will learn

  • Configure a local instance of PySpark in a virtual environment
  • Install and configure Jupyter in local and multi-node environments
  • Create DataFrames from JSON and a dictionary using pyspark.sql
  • Explore regression and clustering models available in the ML module
  • Use DataFrames to transform data used for modeling
  • Connect to PubNub and perform aggregations on streams

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 29, 2018
Length: 330 pages
Edition : 1st
Language : English
ISBN-13 : 9781788835367
Category :
Languages :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. ₹800 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jun 29, 2018
Length: 330 pages
Edition : 1st
Language : English
ISBN-13 : 9781788835367
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
₹800 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
₹4500 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₹400 each
Feature tick icon Exclusive print discounts
₹5000 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₹400 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 8,862.97
Learning PySpark
₹3649.99
PySpark Cookbook
₹3276.99
Hands-On Big Data Analytics with PySpark
₹1935.99
Total 8,862.97 Stars icon

Table of Contents

8 Chapters
Installing and Configuring Spark Chevron down icon Chevron up icon
Abstracting Data with RDDs Chevron down icon Chevron up icon
Abstracting Data with DataFrames Chevron down icon Chevron up icon
Preparing Data for Modeling Chevron down icon Chevron up icon
Machine Learning with MLlib Chevron down icon Chevron up icon
Machine Learning with the ML Module Chevron down icon Chevron up icon
Structured Streaming with PySpark Chevron down icon Chevron up icon
GraphFrames – Graph Theory with PySpark Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Half star icon Empty star icon Empty star icon Empty star icon 1.7
(3 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 66.7%
1 star 33.3%
Dimitri Shvorob Oct 02, 2020
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
Wishing to learn Spark, I signed up for Databricks Associate Spark Developer certification exam - Python flavor - and ordered off Amazon a number of Spark books, avoiding Scala-based titles, and older titles pre-dating the DataFrame API. I ended up with the following list:"Learning PySpark" by Drabas and Lee, published by Packt in 2017"Frank Kane's Taming Big Data with Apache Spark and Python" by (no surprise) Kane, Packt, 2017"Data Analytics with Spark Using Python" by Aven, Addison Wesley, 2018"PySpark Cookbook" by (once again) Drabas and Lee, Packt, 2018"Developing Spark Applications with Python" by Morera and Campos, self-published in 2019"PySpark Recipes" by Mishra, Apress, 2017"Learning Spark" by Damjil et al., O'Reilly, 2020"Beginning Apache Spark Using Azure Databricks" by Ilijason, Apress, 2020"Spark: The Definitive Guide" by Chambers and Zaharia, O'Reilly, 2018Databricks themselves point to "Learning Spark" and "Spark: The Definitive Guide" as preparation aids, so I started with these, skimming both books - and strongly preferring "The Definitive Guide" - and then took a look at the others."PySpark Cookbook" is an easy "pass". It is not as low-quality as the books by Mishra or by Morera and Campo, but it is still a low-quality, low-value-added affair of the type routinely churned out by Packt. Much of the page count is spent on setup matters, where directions may be out of date - then when we get to Spark, a lot of space is taken up by the old RDD interface. Strikingly, Spark SQL gets all of 3 pages (pp. 117-119). Chapter 4 has some more interesting content - several non-trivial data-manipulation tasks that actually merit the "recipe" label - but with that, "core" Spark content ends, and the authors get into streaming, ML and graphs. It's important to remember that Packt pages have less text than pages of books from other publishers: here, 300 "Packt pages" translate to maybe 150 "normal" pages, and that is not a lot.Skip this book, and consider the Databricks-based introduction by Ilijason and the comprehensive but very accessible reference by Chambers and Zaharia.
Amazon Verified review Amazon
mmays Apr 17, 2022
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
Pretty good text, and I like the approach the author takes, but the Kindle version is really awful for the illegible graphics. I've tried them on a Kindle reader, Kindle cloud in a browser, copy and paste, no joy, they are just too small and illegible if magnified.
Amazon Verified review Amazon
Victor Tkachenko Jul 06, 2018
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
This is a plagiary. Guys simply copied all info from the Wiki and trying to make money on it.Shame. No explanation of the code as far as I concern. Don't buy it, You can get more info from Googling...
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.