Packt+ | Advance your knowledge in tech

You're reading from PySpark Cookbook

Product type Book

Published in Jun 2018

Publisher Packt

ISBN-13 9781788835367

Pages 330 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Authors (2):

Denny Lee

Tomasz Drabas

View More author details

Table of Contents (13) Chapters

Title Page

Packt Upsell

Contributors

Preface

1. Installing and Configuring Spark

2. Abstracting Data with RDDs

3. Abstracting Data with DataFrames

4. Preparing Data for Modeling

5. Machine Learning with MLlib

6. Machine Learning with the ML Module

7. Structured Streaming with PySpark

8. GraphFrames – Graph Theory with PySpark

Index

Installing Spark from sources

Spark is distributed in two ways: either as precompiled binaries or as a source code that gives you the flexibility to choose, for example, whether you need support for Hive or not. In this recipe, we will focus on the latter.

Getting ready

To execute this recipe, you will need a bash Terminal and an internet connection. Also, to follow through with this recipe, you will have to have already checked and/or installed all the required environments we went through in the previous recipe. In addition, you need to have administrative privileges (via the sudo command) which will be necessary to move the compiled binaries to the destination folder.

Note

If you are not an administrator on your machine, you can call the script with the -ns (or --nosudo) parameter. The destination folder will then switch to your home directory and will create a spark folder within it. By default, the binaries will be moved to the /opt/spark folder and that's why you need administrative rights.

No other prerequisites are required.

How to do it...

There are five major steps we will undertake to install Spark from sources (check the highlighted portions of the code):

Download the sources from Spark's website
Unpack the archive

Build
Move to the final destination
Create the necessary environmental variables

The skeleton for our code looks as follows (see the Chapter01/installFromSource.sh file):

#!/bin/bash

# Shell script for installing Spark from sources
#
# PySpark Cookbook
# Author: Tomasz Drabas, Denny Lee
# Version: 0.1
# Date: 12/2/2017

_spark_source="http://mirrors.ocf.berkeley.edu/apache/spark/spark-2.3.1/spark-2.3.1.tgz"
_spark_archive=$( echo "$_spark_source" | awk -F '/' '{print $NF}' )
_spark_dir=$( echo "${_spark_archive%.*}" )
_spark_destination="/opt/spark"

...

checkOS
printHeader
downloadThePackage
unpack
build
moveTheBinaries
setSparkEnvironmentVariables
cleanUp

How it works...

First, we specify the location of Spark's source code. The _spark_archive contains the name of the archive; we use awk to extract the last element (here, it is specified by the $NF flag) from the _spark_source. The _spark_dir contains the name of the directory our archive will unpack into; in our current case, this will be spark-2.3.1. Finally, we specify our destination folder where we will be going to move the binaries to: it will either be /opt/spark (default) or your home directory if you use the -ns (or --nosudo) switch when calling the ./installFromSource.sh script.

Next, we check the OS name we are using:

function checkOS(){
 _uname_out="$(uname -s)"
 case "$_uname_out" in
   Linux*) _machine="Linux";;
   Darwin*) _machine="Mac";;
   *) _machine="UNKNOWN:${_uname_out}"
 esac

 if [ "$_machine" = "UNKNOWN:${_uname_out}" ]; then
   echo "Machine $_machine. Stopping."
   exit
 fi
}

First, we get the short name of the operating system using the uname command; the -s switch returns a shortened version of the OS name. As mentioned earlier, we only focus on two operating systems: macOS and Linux, so if you try to run this script on Windows or any other system, it will stop. This portion of the code is necessary to set the _machine flag properly: macOS and Linux use different methods to download the Spark source codes and different bash profile files to set the environment variables.

Next, we print out the header (we will skip the code for this part here, but you are welcome to check the Chapter01/installFromSource.sh script). Following this, we download the necessary source codes:

function downloadThePackage() {
 ...
 if [ -d _temp ]; then
    sudo rm -rf _temp
 fi

 mkdir _temp 
 cd _temp

 if [ "$_machine" = "Mac" ]; then
    curl -O $_spark_source
 elif [ "$_machine" = "Linux"]; then
    wget $_spark_source
 else
    echo "System: $_machine not supported."
    exit
 fi

First, we check whether a _temp folder exists and, if it does, we delete it. Next, we recreate an empty _temp folder and download the sources into it; on macOS, we use the curl method while on Linux, we use wget to download the sources.

Note

Did you notice the ellipsis '...' character in our code? Whenever we use such a character, we omit some less relevant or purely informational portions of the code. They are still present, though, in the sources checked into the GitHub repository.

Once the sources land on our machine, we unpack them using the tar tool, tar -xf $_spark_archive. This happens inside the unpack function.

Finally, we can start building the sources into binaries:

function build(){
 ...

 cd "$_spark_dir"
 ./dev/make-distribution.sh --name pyspark-cookbook -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn

We use the make-distribution.sh script (distributed with Spark) to create our own Spark distribution, named pyspark-cookbook. The previous command will build the Spark distribution for Hadoop 2.7 and with Hive support. We will also be able to deploy it over YARN. Underneath the hood, the make-distribution.sh script is using Maven to compile the sources.

Once the compilation finishes, we need to move the binaries to the _spark_destination folder:

function moveTheBinaries() {

 ...
 if [ -d "$_spark_destination" ]; then 
    sudo rm -rf "$_spark_destination"
 fi

 cd ..
 sudo mv $_spark_dir/ $_spark_destination/

First, we check if the folder in the destination exists and, if it does, we remove it. Next, we simply move (mv) the $_spark_dir folder to its new home.

Note

This is when you will need to type in the password if you did not use the -ns (or --nosudo) flag when invoking the installFromSource.sh script.

One of the last steps is to add new environment variables to your bash profile file:

function setSparkEnvironmentVariables() {
 ...

 if [ "$_machine" = "Mac" ]; then
    _bash=~/.bash_profile
 else
    _bash=~/.bashrc
 fi
 _today=$( date +%Y-%m-%d )

 # make a copy just in case 
 if ! [ -f "$_bash.spark_copy" ]; then
        cp "$_bash" "$_bash.spark_copy"
 fi

 echo >> $_bash 
 echo "###################################################" >> $_bash
 echo "# SPARK environment variables" >> $_bash
 echo "#" >> $_bash
 echo "# Script: installFromSource.sh" >> $_bash
 echo "# Added on: $_today" >>$_bash
 echo >> $_bash

 echo "export SPARK_HOME=$_spark_destination" >> $_bash
 echo "export PYSPARK_SUBMIT_ARGS=\"--master local[4]\"" >> $_bash
 echo "export PYSPARK_PYTHON=$(type -p python)" >> $_bash
 echo "export PYSPARK_DRIVER_PYTHON=jupyter" >> $_bash

 echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"notebook --NotebookApp.open_browser=False --NotebookApp.port=6661\"" >> $_bash

 echo "export PATH=$SPARK_HOME/bin:\$PATH" >> $_bash
}

First, we check what OS system we're on and select the appropriate bash profile file. We also grab the current date (the _today variable) so that we can include that information in our bash profile file, and create its safe copy (just in case, and if one does not already exist). Next, we start to append new lines to the bash profile file:

We first set the SPARK_HOME variable to the _spark_destination; this is either going to be the /opt/spark or ~/spark location.
The PYSPARK_SUBMIT_ARGS variable is used when you invoke pyspark. It instructs Spark to use four cores of your CPU; changing it to --master local[*] will use all the available cores.
We specify the PYSPARK_PYTHON variable so, in case of multiple Python installations present on the machine, pyspark will use the one that we checked for in the first recipe.
Setting the PYSPARK_DRIVER_PYTHON to jupyter will start a Jupyter session (instead of the PySpark interactive shell).
The PYSPARK_DRIVER_PYTHON_OPS instructs Jupyter to:
- Start a notebook
- Do not open the browser by default: use the --NotebookApp.open_browser=False flag
- Change the default port (8888) to 6661 (because we are big fans of not having things at default for safety reasons)

Finally, we add the bin folder from SPARK_HOME to the PATH.

The last step is to cleanUp after ourselves; we simply remove the _temp folder with everything in it.

Now that we have installed Spark, let's test if everything works. First, in order to make all the environment variables accessible in the Terminal's session, we need to refresh the bash session: you can either close and reopen the Terminal, or execute the following command (on macOS):

source ~/.bash_profile

On Linux, execute the following command:

source ~/.bashrc

Next, you should be able to execute the following:

pyspark --version

If all goes well, you should see a response similar to the one shown in the following screenshot:

There's more...

Instead of using the make-distribution.sh script from Spark, you can use Maven directly to compile the sources. For instance, if you wanted to build the default version of Spark, you could simply type (from the _spark_dir folder):

./build/mvn clean package

This would default to Hadoop 2.6. If your version of Hadoop was 2.7.2 and was deployed over YARN, you can do the following:

./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.2 -DskipTests clean package

You can also use Scala to build Spark:

./build/sbt package

You're reading from PySpark Cookbook

Table of Contents (13) Chapters

Installing Spark from sources

Getting ready

Note

How to do it...

How it works...

Note

Note

There's more...

See also

Authors (2)

Other recommended products

Personalised recommendations for you

You're reading from PySpark Cookbook

Table of Contents (13) Chapters close

Installing Spark from sources

Getting ready

Note

How to do it...

How it works...

Note

Note

There's more...

See also

Authors (2)

Other recommended products

Personalised recommendations for you

Table of Contents (13) Chapters