Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Apache Spark Deep Learning Cookbook
Apache Spark Deep Learning Cookbook

Apache Spark Deep Learning Cookbook: Over 80 best practice recipes for the distributed training and deployment of neural networks using Keras and TensorFlow

Arrow left icon
Profile Icon Ahmed Sherif Profile Icon Ravindra
Arrow right icon
€8.99 €32.99
Full star icon Half star icon Empty star icon Empty star icon Empty star icon 1.7 (6 Ratings)
eBook Jul 2018 474 pages 1st Edition
eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Ahmed Sherif Profile Icon Ravindra
Arrow right icon
€8.99 €32.99
Full star icon Half star icon Empty star icon Empty star icon Empty star icon 1.7 (6 Ratings)
eBook Jul 2018 474 pages 1st Edition
eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Apache Spark Deep Learning Cookbook

Setting Up Spark for Deep Learning Development

In this chapter, the following recipes will be covered:

  • Downloading an Ubuntu Desktop image
  • Installing and configuring Ubuntu with VMWare Fusion on macOS
  • Installing and configuring Ubuntu with Oracle VirtualBox on Windows
  • Installing and configuring Ubuntu Desktop for Google Cloud Platform
  • Installing and configuring Spark and prerequisites on Ubuntu Desktop
  • Integrating Jupyter notebooks with Spark
  • Starting and configuring a Spark cluster
  • Stopping a Spark cluster

Introduction

Deep learning is the focused study of machine learning algorithms that deploy neural networks as their main method of learning. Deep learning has exploded onto the scene just within the last couple of years. Microsoft, Google, Facebook, Amazon, Apple, Tesla and many other companies are all utilizing deep learning models in their apps, websites, and products. At the same exact time, Spark, an in-memory compute engine running on top of big data sources, has made it easy to process volumes of information at record speeds and ease. In fact, Spark has now become the leading big data development tool for data engineers, machine learning engineers, and data scientists.

Since deep learning models perform better with more data, the synergy between Spark and deep learning allowed for a perfect marriage. Almost as important as the code used to execute deep learning algorithms is the work environment that enables optimal development. Many talented minds are eager to develop neural networks to help answer important questions in their research. Unfortunately, one of the greatest barriers to the development of deep learning models is access to the necessary technical resources required to learn on big data. The purpose of this chapter is to create an ideal virtual development environment for deep learning on Spark.

Downloading an Ubuntu Desktop image

Spark can be set up for all types of operating systems, whether they reside on-premise or in the cloud. For our purposes, Spark will be installed on a Linux-based virtual machine with Ubuntu as the operating system. There are several advantages to using Ubuntu as the go-to virtual machine, not least of which is cost. Since they are based on open source software, Ubuntu operating systems are free to use and do not require licensing. Cost is always a consideration and one of the main goals of this publication is to minimize the financial footprint required to get started with deep learning on top of a Spark framework.

Getting ready

There are some minimum recommendations required for downloading the image file:

  • Minimum of 2 GHz dual-core processor
  • Minimum of 2 GB system memory
  • Minimum of 25 GB of free hard drive space

How to do it...

Follow the steps in the recipe to download an Ubuntu Desktop image:

  1. In order to create a virtual machine of Ubuntu Desktop, it is necessary to first download the file from the official website: https://www.ubuntu.com/download/desktop.
  2. As of this writing, Ubuntu Desktop 16.04.3 is the most recent available version for download.

  1. Access the following file in a .iso format once the download is complete:

    ubuntu-16.04.3-desktop-amd64.iso

How it works...

Virtual environments provide an optimal development workspace by isolating the relationship to the physical or host machine. Developers may be using all types of machines for their host environments such as a MacBook running macOS, a Microsoft Surface running Windows or even a virtual machine on the cloud with Microsoft Azure or AWS; however, to ensure consistency within the output of the code executed, a virtual environment within Ubuntu Desktop will be deployed that can be used and shared among a wide variety of host platforms.

There's more...

There are several options for desktop virtualization software, depending on whether the host environment is on a Windows or a macOS. There are two common software applications for virtualization when using macOS:

  • VMWare Fusion
  • Parallels

See also

Installing and configuring Ubuntu with VMWare Fusion on macOS

This section will focus on building a virtual machine using an Ubuntu operating system with VMWare Fusion.

Getting ready

How to do it...

Follow the steps in the recipe to configure Ubuntu with VMWare Fusion on macOS:

  1. Once VMWare Fusion is up and running, click on the + button on the upper-left-hand side to begin the configuration process and select New..., as seen in the following screenshot:
  1. Once the selection has been made, select the option to Install from Disk or Image, as seen in the following screenshot:
  1. Select the operating system's iso file that was downloaded from the Ubuntu Desktop website, as seen in the following screenshot:
  1. The next step will ask whether you want to choose Linux Easy Install. It is recommended to do so, as well as to incorporate a Display Name/Password combination for the Ubuntu environment, as seen in the following screenshot:
  1. The configuration process is almost complete. A Virtual Machine Summary is displayed with the option to Customize Settings to increase the Memory and Hard Disk, as seen in the following screenshot:
  1. Anywhere from 20 to 40 GB hard disk space is sufficient for the virtual machine; however, bumping up the memory to either 2 GB or even 4 GB will assist with the performance of the virtual machine when executing Spark code in later chapters. Update the memory by selecting Processors and Memory under the Settings of the virtual machine and increasing the Memory to the desired amount, as seen in the following screenshot:

How it works...

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on VMWare Fusion. The memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.

There's more...

All that is remaining is to fire up the virtual machine for the first time, which initiates the installation process of the system onto the virtual machine. Once all the setup is complete and the user has logged in, the Ubuntu virtual machine should be available for development, as seen in the following screenshot:

See also

Aside from VMWare Fusion, there is also another product that offers similar functionality on a Mac. It is called Parallels Desktop for Mac. To learn more about VMWare and Parallels, and decide which program is a better fit for your development, visit the following websites:

Installing and configuring Ubuntu with Oracle VirtualBox on Windows

Unlike with macOS, there are several options to virtualize systems within Windows. This mainly has to do with the fact that virtualization on Windows is very common as most developers are using Windows as their host environment and need virtual environments for testing purposes without affecting any of the dependencies that rely on Windows.

Getting ready

VirtualBox from Oracle is a common virtualization product and is free to use. Oracle VirtualBox provides a straightforward process to get an Ubuntu Desktop virtual machine up and running on top of a Windows environment.

How to do it...

Follow the steps in this recipe to configure Ubuntu with VirtualBox on Windows:

  1. Initiate an Oracle VM VirtualBox Manager. Next, create a new virtual machine by selecting the New icon and specify the Name, Type, and Version of the machine, as seen in the following screenshot:
  1. Select Expert Mode as several of the configuration steps will get consolidated, as seen in the following screenshot:

Ideal memory size should be set to at least 2048 MB, or preferably 4096 MB, depending on the resources available on the host machine.

  1. Additionally, set an optimal hard disk size for an Ubuntu virtual machine performing deep learning algorithms to at least 20 GB, if not more, as seen in the following screenshot:
  1. Point the virtual machine manager to the start-up disk location where the Ubuntu iso file was downloaded to and then Start the creation process, as seen in the following screenshot:
  1. After allotting some time for the installation, select the Start icon to complete the virtual machine and get it ready for development as seen in the following screenshot:

How it works...

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on Oracle VirtualBox. As was the case with VMWare Fusion, the memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.

There's more...

Please note that some machines that run Microsoft Windows are not set up by default for virtualization and users may receive an initial error indicating the VT-x is not enabled. This can be reversed and virtualization may be enabled in the BIOS during a reboot.

See also

To learn more about Oracle VirtualBox and decide whether or not it is a good fit, visit the following website and select Windows hosts to begin the download process: https://www.virtualbox.org/wiki/Downloads.

Installing and configuring Ubuntu Desktop for Google Cloud Platform

Previously, we saw how Ubuntu Desktop could be set up locally using VMWare Fusion. In this section, we will learn how to do the same on Google Cloud Platform.

Getting ready

The only requirement is a Google account username. Begin by logging in to your Google Cloud Platform using your Google account. Google provides a free 12-month subscription with $300 credited to your account. The setup will ask for your bank details; however, Google will not charge you for anything without explicitly letting you know first. Go ahead and verify your bank account and you are good to go.

How to do it...

Follow the steps in the recipe to configure Ubuntu Desktop for Google Cloud Platform:

  1. Once logged in to your Google Cloud Platform, access a dashboard that looks like the one in the following screenshot:
Google Cloud Platform Dashboard
  1. First, click on the product services button in the top-left-hand corner of your screen. In the drop-down menu, under Compute, click on VM instances, as shown in the following screenshot:
  1. Create a new instance and name it. We are naming it ubuntuvm1 in our case. Google Cloud automatically creates a project while launching an instance and the instance will be launched under a project ID. The project may be renamed if required.

  1. After clicking on Create Instance, select the zone/area you are located in.
  2. Select Ubuntu 16.04LTS under the boot disk as this is the operating system that will be installed in the cloud. Please note that LTS stands for version, and will have long-term support from Ubuntu’s developers.
  3. Next, under the boot disk options, select SSD persistent disk and increase the size to 50 GB for some added storage space for the instance, as shown in the following screenshot:
  1. Next, set Access scopes to Allow full access to all Cloud APIs.
  2. Under firewall, please check to allow HTTP traffic as well as allow HTTPS traffic, as shown in the following screenshot:
Selecting options  Allow HTTP traffic and HTTPS Traffic
  1. Once the instance is configured as shown in this section, go ahead and create the instance by clicking on the Create button.
After clicking on the Create button, you will notice that the instance gets created with a unique internal as well as external IP address. We will require this at a later stage. SSH refers to secure shell tunnel, which is basically an encrypted way of communicating in client-server architectures. Think of it as data going to and from your laptop, as well as going to and from Google's cloud servers, through an encrypted tunnel.
  1.  Click on the newly created instance. From the drop-down menu, click on open in browser window, as shown in the following screenshot:
  1. You will see that Google opens up a shell/terminal in a new window, as shown in the following screenshot:
  1. Once the shell is open, you should have a window that looks like the following screenshot:
  1. Type the following commands in the Google cloud shell:
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install gnome-shell
$ sudo apt-get install ubuntu-gnome-desktop
$ sudo apt-get install autocutsel
$ sudo apt-get install gnome-core
$ sudo apt-get install gnome-panel
$ sudo apt-get install gnome-themes-standard
  1. When presented with a prompt to continue or not, type y and select ENTER, as shown in the following screenshot:
  1. Once done with the preceding steps, type the following commands to set up the vncserver and allow connections to the local shell:
$ sudo apt-get install tightvncserver
$ touch ~/.Xresources
  1. Next, launch the server by typing the following command:
$ tightvncserver
  1. This will prompt you to enter a password, which will later be used to log in to the Ubuntu Desktop virtual machine. This password is limited to eight characters and needs to be set and verified, as shown in the following screenshot:
  1. A startup script is automatically generated by the shell, as shown in the following screenshot. This startup script can be accessed and edited by copying and pasting its PATH in the following manner:
  1. In our case, the command to view and edit the script is:
:~$ vim /home/amrith2kmeanmachine/.vnc/xstartup

This PATH may be different in each case. Ensure you set the right PATH. The vim command opens up the script in the text editor on a Mac.

The local shell generated a startup script as well as a log file. The startup script needs to be opened and edited in a text editor, which will be discussed next.
  1. After typing the vim command, the screen with the startup script should look something like this screenshot:
  1. Type i to enter INSERT mode. Next, delete all the text in the startup script. It should then look like the following screenshot:
  1. Copy paste the following code into the startup script:
#!/bin/sh
autocutsel -fork
xrdb $HOME/.Xresources
xsetroot -solid grey
export XKL_XMODMAP_DISABLE=1
export XDG_CURRENT_DESKTOP="GNOME-Flashback:Unity"
export XDG_MENU_PREFIX="gnome-flashback-"
unset DBUS_SESSION_BUS_ADDRESS
gnome-session --session=gnome-flashback-metacity --disable-acceleration-check --debug &
  1. The script should appear in the editor, as seen in the following screenshot:
  1. Press Esc to exit out of INSERT mode and type :wq to write and quit the file.
  2. Once the startup script has been configured, type the following command in the Google shell to kill the server and save the changes:
$ vncserver -kill :1
  1. This command should produce a process ID that looks like the one in the following screenshot:
  1. Start the server again by typing the following command:
$ vncserver -geometry 1024x640

The next series of steps will focus on securing the shell tunnel into the Google Cloud instance from the local host. Before typing anything on the local shell/terminal, ensure that Google Cloud is installed. If not already installed, do so by following the instructions in this quick-start guide located at the following website:

https://cloud.google.com/sdk/docs/quickstart-mac-os-x

  1. Once Google Cloud is installed, open up the terminal on your machine and type the following commands to connect to the Google Cloud compute instance:
$ gcloud compute ssh \
YOUR INSTANCE NAME HERE \
--project YOUR PROJECT NAME HERE \
--zone YOUR TIMEZONE HERE \
--ssh-flag "-L 5901:localhost:5901"
  1. Ensure that the instance name, project ID, and zone are specified correctly in the preceding commands. On pressing ENTER, the output on the local shell changes to what is shown in the following screenshot:
  1. Once you see the name of your instance followed by ":~$", it means that a connection has successfully been established between the local host/laptop and the Google Cloud instance. After successfully SSHing into the instance, we require software called VNC Viewer to view and interact with the Ubuntu Desktop that has now been successfully set up on the Google Cloud Compute engine. The following few steps will discuss how this is achieved.
  1. VNC Viewer may be downloaded using the following link:

https://www.realvnc.com/en/connect/download/viewer/

  1. Once installed, click to open VNC Viewer and in the search bar, type in localhost::5901, as shown in the following screenshot:
  1. Next, click on continue when prompted with the following screen:
  1. This will prompt you to enter your password for the virtual machine. Enter the password that you set earlier while launching the tightvncserver command for the first time, as shown in the following screenshot:
  1. You will finally be taken into the desktop of your Ubuntu virtual machine on Google Cloud Compute. Your Ubuntu Desktop screen must now look something like the following screenshot when viewed on VNC Viewer:

How it works...

You have now successfully set up VNC Viewer for interactions with the Ubuntu virtual machine/desktop. Anytime the Google Cloud instance is not in use, it is recommended to suspend or shut down the instance so that additional costs are not being incurred. The cloud approach is optimal for developers who may not have access to physical resources with high memory and storage.

There's more...

While we discussed Google Cloud as a cloud option for Spark,  it is possible to leverage Spark on the following cloud platforms as well:

  • Microsoft Azure
  • Amazon Web Services

See also

In order to learn more about Google Cloud Platform and sign up for a free subscription, visit the following website:

https://cloud.google.com/

Installing and configuring Spark and prerequisites on Ubuntu Desktop

Before Spark can get up and running, there are some necessary prerequisites that need to be installed on a newly minted Ubuntu Desktop. This section will focus on installing and configuring the following on Ubuntu Desktop:

  • Java 8 or higher
  • Anaconda
  • Spark

Getting ready

The only requirement for this section is having administrative rights to install applications onto the Ubuntu Desktop.

How to do it...

This section walks through the steps in the recipe to install Python 3, Anaconda, and Spark on Ubuntu Desktop:

  1. Install Java on Ubuntu through the terminal application, which can be found by searching for the app and then locking it to the launcher on the left-hand side, as seen in the following screenshot:
  1. Perform an initial test for Java on the virtual machine by executing the following command at the terminal:
java -version
  1. Execute the following four commands at the terminal to install Java:
sudo apt-get install software-properties-common 
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
  1. After accepting the necessary license agreements for Oracle, perform a secondary test of Java on the virtual machine by executing java -version once again in the terminal. A successful installation for Java will display the following outcome in the terminal:
$ java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
  1. Next, install the most recent version of Anaconda. Current versions of Ubuntu Desktop come preinstalled with Python. While it is convenient that Python comes preinstalled with Ubuntu, the installed version is for Python 2.7, as seen in the following output:
$ python --version
Python 2.7.12
  1. The current version of Anaconda is v4.4 and the current version of Python 3 is v3.6. Once downloaded, view the Anaconda installation file by accessing the Downloads folder using the following command:
$ cd Downloads/
~/Downloads$ ls
Anaconda3-4.4.0-Linux-x86_64.sh
  1. Once in the Downloads folder, initiate the installation for Anaconda by executing the following command:
~/Downloads$ bash Anaconda3-4.4.0-Linux-x86_64.sh 
Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.)
In order to continue the installation process, please review the license agreement.
Please, press ENTER to continue
Please note that the version of Anaconda, as well as any other software installed, may differ as newer updates are released to the public. The version of Anaconda that we are using in this chapter and in this book can be downloaded from https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86.sh
  1. Once the Anaconda installation is complete, restart the Terminal application to confirm that Python 3 is now the default Python environment through Anaconda by executing python --version in the terminal:
$ python --version
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
  1. The Python 2 version is still available under Linux, but will require an explicit call when executing a script, as seen in the following command:
~$ python2 --version
Python 2.7.12
  1. Visit the following website to begin the Spark download and installation process:

https://spark.apache.org/downloads.html

  1. Select the download link. The following file will be downloaded to the Downloads folder in Ubuntu:

spark-2.2.0-bin-hadoop2.7.tgz

  1. View the file at the terminal level by executing the following commands:
$ cd Downloads/
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7.tgz
  1. Extract the tgz file by executing the following command:
~/Downloads$ tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
  1. Another look at the Downloads directory using ls shows both the tgz file and the extracted folder:
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7 spark-2.2.0-bin-hadoop2.7.tgz
  1. Move the extracted folder from the Downloads folder to the Home folder by executing the following command:
~/Downloads$ mv spark-2.2.0-bin-hadoop2.7 ~/
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7.tgz
~/Downloads$ cd
~$ ls
anaconda3 Downloads Pictures Templates
Desktop examples.desktop Public Videos
Documents Music spark-2.2.0-bin-hadoop2.7
  1. Now, the spark-2.2.0-bin-hadoop2.7 folder has been moved to the Home folder, which can be viewed when selecting the Files icon on the left-hand side toolbar, as seen in the following screenshot:
  1. Spark is now installed. Initiate Spark from the terminal by executing the following script at the terminal level:
~$ cd ~/spark-2.2.0-bin-hadoop2.7/
~/spark-2.2.0-bin-hadoop2.7$ ./bin/pyspark
  1. Perform a final test to ensure Spark is up and running at the terminal by executing the following command to ensure that the SparkContext is driving the cluster in the local environment:
>>> sc
<SparkContext master=local[*] appName=PySparkShell>

How it works...

This section explains the reasoning behind the installation process for Python, Anaconda, and Spark.

  1. Spark runs on the Java virtual machine (JVM), the Java Software Development Kit (SDK) is a prerequisite installation for Spark to run on an Ubuntu virtual machine.
In order for Spark to run on a local machine or in a cluster, a minimum version of Java 6 is required for installation.
  1. Ubuntu recommends the sudo apt install method for Java as it ensures that packages downloaded are up to date. 
  2. Please note that if Java is not currently installed, the output in the terminal will show the following message:
    The program 'java' can be found in the following packages:
    * default-jre
    * gcj-5-jre-headless
    * openjdk-8-jre-headless
    * gcj-4.8-jre-headless
    * gcj-4.9-jre-headless
    * openjdk-9-jre-headless
    Try: sudo apt install <selected package>
    1. While Python 2 is fine, it is considered legacy Python. Python 2 is facing an end of life date in 2020; therefore, it is recommended that all new Python development be performed with Python 3, as will be the case in this publication. Up until recently, Spark was only available with Python 2. That is no longer the case. Spark works with both Python 2 and 3. A convenient way to install Python 3, as well as many dependencies and libraries, is through Anaconda. Anaconda is a free and open source distribution of Python, as well as R. Anaconda manages the installation and maintenance of many of the most common packages used in Python for data science-related tasks.
    2. During the installation process for Anaconda, it is important to confirm the following conditions: 

      • Anaconda is installed in the /home/username/Anaconda3 location
      • The Anaconda installer prepends the Anaconda3 install location to a PATH in /home/username/.bashrc
    1. After Anaconda has been installed, download Spark. Unlike Python, Spark does not come preinstalled on Ubuntu and therefore, will need to be downloaded and installed.
    2. For the purposes of development with deep learning, the following preferences will be selected for Spark:

      • Spark release: 2.2.0 (Jul 11 2017)
      • Package type: Prebuilt for Apache Hadoop 2.7 and later
      • Download type: Direct download
    3. Once Spark has been successfully installed, the output from executing Spark at the command line should look something similar to that shown in the following screenshot:
    1. Two important features to note when initializing Spark are that it is under the Python 3.6.1 | Anaconda 4.4.0 (64-bit) | framework and that the Spark logo is version 2.2.0.
    2. Congratulations! Spark is successfully installed on the local Ubuntu virtual machine. But, not everything is complete. Spark development is best when Spark code can be executed within a Jupyter notebook, especially for deep learning. Thankfully, Jupyter has been installed with the Anaconda distribution performed earlier in this section.

    There's more...

    You may be asking why we did not just use pip install pyspark to use Spark in Python. Previous versions of Spark required going through the installation process that we did in this section. Future versions of Spark, starting with 2.2.0 will begin to allow installation directly through the pip approach. We used the full installation method in this section to ensure that you will be able to get Spark installed and fully-integrated, in case you are using an earlier version of Spark.

    See also

    Integrating Jupyter notebooks with Spark

    When learning Python for the first time, it is useful to use Jupyter notebooks as an interactive developing environment (IDE). This is one of the main reasons why Anaconda is so powerful. It fully integrates all of the dependencies between Python and Jupyter notebooks. The same can be done with PySpark and Jupyter notebooks. While Spark is written in Scala, PySpark allows for the translation of code to occur within Python instead.

    Getting ready

    Most of the work in this section will just require accessing the .bashrc script from the terminal.

    How to do it...

    PySpark is not configured to work within Jupyter notebooks by default, but a slight tweak of the .bashrc script can remedy this issue. We will walk through these steps in this section:

    1. Access the .bashrc script by executing the following command:
    $ nano .bashrc
    1. Scrolling all the way to the end of the script should reveal the last command modified, which should be the PATH set by Anaconda during the installation earlier in the previous section. The PATH should appear as seen in the following:
    # added by Anaconda3 4.4.0 installer
    export PATH="/home/asherif844/anaconda3/bin:$PATH"
    1. Underneath, the PATH added by the Anaconda installer can include a custom function that helps communicate the Spark installation with the Jupyter notebook installation from Anaconda3. For the purposes of this chapter and remaining chapters, we will name that function sparknotebook. The configuration should appear as the following for sparknotebook():
    function sparknotebook()
    {
    export SPARK_HOME=/home/asherif844/spark-2.2.0-bin-hadoop2.7
    export PYSPARK_PYTHON=python3
    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
    $SPARK_HOME/bin/pyspark
    }
    1. The updated .bashrc script should look like the following once saved:
    1. Save and exit from the .bashrc file. It is recommended to communicate that the .bashrc file has been updated by executing the following command and restarting the terminal application:
    $ source .bashrc

    How it works...

    Our goal in this section is to integrate Spark directly into a Jupyter notebook so that we are not doing our development at the terminal and instead utilizing the benefits of developing within a notebook. This section explains how the Spark integration within a Jupyter notebook takes place.

    1. We will create a command function, sparknotebook, that we can call from the terminal to open up a Spark session through Jupyter notebooks from the Anaconda installation. This requires two settings to be set in the .bashrc file:
      1. PySpark Python be set to python 3
      2. PySpark driver for python to be set to Jupyter
    2. The sparknotebook function can now be accessed directly from the terminal by executing the following command:
    $ sparknotebook
    1. The function should then initiate a brand new Jupyter notebook session through the default web browser. A new Python script within Jupyter notebooks with a .ipynb extension can be created by clicking on the New button on the right-hand side and by selecting Python 3 under Notebook: as seen in the following screenshot:
    1. Once again, just as was done at the terminal level for Spark, a simple script of sc will be executed within the notebook to confirm that Spark is up and running through Jupyter:
    1. Ideally, the Version, Master, and AppName should be identical to the earlier output when sc was executed at the terminal. If this is the case, then PySpark has been successfully installed and configured to work with Jupyter notebooks.

    There's more...

    It is important to note that if we were to call a Jupyter notebook through the terminal without specifying sparknotebook, our Spark session will never be initiated and we will receive an error when executing the SparkContext script.

    We can access a traditional Jupyter notebook by executing the following at the terminal:

    jupyter-notebook

    Once we start the notebook, we can try and execute the same script for sc.master as we did previously, but this time we will receive the following error:

    See also

    Starting and configuring a Spark cluster

    For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.

    Getting ready

    Import the following before initializing cluster.

    • from pyspark.sql import SparkSession

    How to do it...

    This section walks through the steps to initialize and configure a Spark cluster.

    1. Import SparkSession using the following script:
    from pyspark.sql import SparkSession
    1. Configure SparkSession with a variable named spark using the following script:
    spark = SparkSession.builder \
    .master("local[*]") \
    .appName("GenericAppName") \
    .config("spark.executor.memory", "6gb") \
    .getOrCreate()

    How it works...

    This section explains how the SparkSession works as an entry point to develop within Spark.

    1. Staring with Spark 2.0, it is no longer necessary to create a SparkConf and SparkContext to begin development in Spark. Those steps are no longer needed as importing SparkSession will handle initializing a cluster.  Additionally, it is important to note that SparkSession is part of the sql module from pyspark.
    2. We can assign properties to our SparkSession:
      1. master: assigns the Spark master URL to run on our local machine with the maximum available number of cores.  
      2. appName: assign a name for the application
      3.  config: assign 6gb to the spark.executor.memory
      4. getOrCreate: ensures that a SparkSession is created if one is not available and retrieves an existing one if it is available

    There's more...

    For development purposes, while we are building an application on smaller datasets, we can just use master("local").  If we were to deploy on a production environment, we would want to specify master("local[*]") to ensure we are using the maximum cores available and get optimal performance.

    See also

    Stopping a Spark cluster

    Once we are done developing on our cluster, it is ideal to shut it down and preserve resources.

    How to do it...

    This section walks through the steps to stop the SparkSession.

    1. Execute the following script:

    spark.stop()

    1. Confirm that the session has closed by executing the following script:

    sc.master

    How it works...

    This section explains how to confirm that a Spark cluster has been shut down.

    1. If the cluster has been shut down, you will receive the error message seen in the following screenshot when executing another Spark command in the notebook:

    There's more...

    Shutting down Spark clusters may not be as critical when working in a local environment; however, it will prove costly when Spark is deployed in a cloud environment where you are charged for compute power.

     

    Left arrow icon Right arrow icon
    Download code icon Download Code

    Key benefits

    • Train distributed complex neural networks on Apache Spark
    • Use TensorFlow and Keras to train and deploy deep learning models
    • Explore practical tips to enhance performance

    Description

    Organizations these days need to integrate popular big data tools such as Apache Spark with highly efficient deep learning libraries if they’re looking to gain faster and more powerful insights from their data. With this book, you’ll discover over 80 recipes to help you train fast, enterprise-grade, deep learning models on Apache Spark. Each recipe addresses a specific problem, and offers a proven, best-practice solution to difficulties encountered while implementing various deep learning algorithms in a distributed environment. The book follows a systematic approach, featuring a balance of theory and tips with best practice solutions to assist you with training different types of neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). You’ll also have access to code written in TensorFlow and Keras that you can run on Spark to solve a variety of deep learning problems in computer vision and natural language processing (NLP), or tweak to tackle other problems encountered in deep learning. By the end of this book, you'll have the skills you need to train and deploy state-of-the-art deep learning models on Apache Spark.

    Who is this book for?

    If you’re looking for a practical resource for implementing efficiently distributed deep learning models with Apache Spark, then this book is for you. Knowledge of core machine learning concepts and a basic understanding of the Apache Spark framework is required to get the most out of this book. Some knowledge of Python programming will also be useful.

    What you will learn

    • Set up a fully functional Spark environment
    • Understand practical machine learning and deep learning concepts
    • Employ built-in machine learning libraries within Spark
    • Discover libraries that are compatible with TensorFlow and Keras
    • Explore NLP models such as word2vec and TF-IDF on Spark
    • Organize DataFrames for deep learning evaluation
    • Apply testing and training modeling to ensure accuracy
    • Access readily available code that can be reused

    Product Details

    Country selected
    Publication date, Length, Edition, Language, ISBN-13
    Publication date : Jul 13, 2018
    Length: 474 pages
    Edition : 1st
    Language : English
    ISBN-13 : 9781788471558
    Category :
    Languages :
    Concepts :

    What do you get with eBook?

    Product feature icon Instant access to your Digital eBook purchase
    Product feature icon Download this book in EPUB and PDF formats
    Product feature icon Access this title in our online reader with advanced features
    Product feature icon DRM FREE - Read whenever, wherever and however you want
    OR
    Modal Close icon
    Payment Processing...
    tick Completed

    Billing Address

    Product Details

    Publication date : Jul 13, 2018
    Length: 474 pages
    Edition : 1st
    Language : English
    ISBN-13 : 9781788471558
    Category :
    Languages :
    Concepts :

    Packt Subscriptions

    See our plans and pricing
    Modal Close icon
    €18.99 billed monthly
    Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
    Feature tick icon Constantly refreshed with 50+ new titles a month
    Feature tick icon Exclusive Early access to books as they're written
    Feature tick icon Solve problems while you work with advanced search and reference features
    Feature tick icon Offline reading on the mobile app
    Feature tick icon Simple pricing, no contract
    €189.99 billed annually
    Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
    Feature tick icon Constantly refreshed with 50+ new titles a month
    Feature tick icon Exclusive Early access to books as they're written
    Feature tick icon Solve problems while you work with advanced search and reference features
    Feature tick icon Offline reading on the mobile app
    Feature tick icon Choose a DRM-free eBook or Video every month to keep
    Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
    Feature tick icon Exclusive print discounts
    €264.99 billed in 18 months
    Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
    Feature tick icon Constantly refreshed with 50+ new titles a month
    Feature tick icon Exclusive Early access to books as they're written
    Feature tick icon Solve problems while you work with advanced search and reference features
    Feature tick icon Offline reading on the mobile app
    Feature tick icon Choose a DRM-free eBook or Video every month to keep
    Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
    Feature tick icon Exclusive print discounts

    Frequently bought together


    Stars icon
    Total 120.97
    Java Deep Learning Projects
    €41.99
    Apache Spark Deep Learning Cookbook
    €41.99
    Hands-On Deep Learning with Apache Spark
    €36.99
    Total 120.97 Stars icon
    Banner background image

    Table of Contents

    14 Chapters
    Setting Up Spark for Deep Learning Development Chevron down icon Chevron up icon
    Creating a Neural Network in Spark Chevron down icon Chevron up icon
    Pain Points of Convolutional Neural Networks Chevron down icon Chevron up icon
    Pain Points of Recurrent Neural Networks Chevron down icon Chevron up icon
    Predicting Fire Department Calls with Spark ML Chevron down icon Chevron up icon
    Using LSTMs in Generative Networks Chevron down icon Chevron up icon
    Natural Language Processing with TF-IDF Chevron down icon Chevron up icon
    Real Estate Value Prediction Using XGBoost Chevron down icon Chevron up icon
    Predicting Apple Stock Market Cost with LSTM Chevron down icon Chevron up icon
    Face Recognition Using Deep Convolutional Networks Chevron down icon Chevron up icon
    Creating and Visualizing Word Vectors Using Word2Vec Chevron down icon Chevron up icon
    Creating a Movie Recommendation Engine with Keras Chevron down icon Chevron up icon
    Image Classification with TensorFlow on Spark Chevron down icon Chevron up icon
    Other Books You May Enjoy Chevron down icon Chevron up icon

    Customer reviews

    Top Reviews
    Rating distribution
    Full star icon Half star icon Empty star icon Empty star icon Empty star icon 1.7
    (6 Ratings)
    5 star 16.7%
    4 star 0%
    3 star 0%
    2 star 0%
    1 star 83.3%
    Filter icon Filter
    Top Reviews

    Filter reviews by




    Adnan Masood, PhD Aug 16, 2018
    Full star icon Full star icon Full star icon Full star icon Full star icon 5
    The tremendous impact of Artificial Intelligence and Machine learning, and the uncanny effectiveness of deep neural networks are hard to escape in both academia and industry. Meanwhile implementation grade material outlining the deep learning using Spark are not always easy to find. The manuscript you are holding in your hands (or in your e-reader) is an problem-solution oriented approach which not only shows Spark’s capabilities but also the art of possible around various machine learning and deep learning problems.Full disclosure, I am the technical reviewer of the book and wrote the foreword. It was a pleasure reading and reviewing the Ahmed Sherif and Amrith Ravindra’s work which I hope you as a reader will also find very compelling.The authors begin with helping to set up Spark for Deep Learning development by providing clear and concisely written recipes. The initial setup is naturally followed by creating a neural network, elaborating on pain points of Convolutional Neural Networks, and Recurrent Neural Networks. Later on, authors provided practical (yet simplified) use cases of predicting fire department calls with SparkML, real estate value prediction using XGBoost, predicting the stock market cost of Apple with LSTM, and creating a movie recommendation Engine with Keras. The book covers pertinent and highly relevant technologies with operational use cases like LSTMs in Generative Networks, natural language processing with TF-IDF, face recognition using deep convolutional networks, creating and visualizing word vectors using Word2Vec and image classification with TensorFlow on Spark. Beside crisp and focused writing, this wide array of highly relevant machine learning and deep learning topics give the book its core strength.I hope this book will help you leverage Apache Spark to tackle business and technology problems. Highly recommended reading.
    Amazon Verified review Amazon
    Todd Sep 06, 2023
    Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
    $55 bucks… not worth it. There’s this website called google
    Amazon Verified review Amazon
    Bethany Sep 06, 2023
    Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
    An absolutely disgrace from cover to cover. I’d rather eat the book than use the recipes.
    Amazon Verified review Amazon
    Amanda Sep 02, 2023
    Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
    Wouldn’t recommend
    Amazon Verified review Amazon
    Patrick Sullivan Sep 06, 2023
    Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
    Total Garbage. Not worth it
    Amazon Verified review Amazon
    Get free access to Packt library with over 7500+ books and video courses for 7 days!
    Start Free Trial

    FAQs

    How do I buy and download an eBook? Chevron down icon Chevron up icon

    Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

    If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

    Please Note: Packt eBooks are non-returnable and non-refundable.

    Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

    • You may make copies of your eBook for your own use onto any machine
    • You may not pass copies of the eBook on to anyone else
    How can I make a purchase on your website? Chevron down icon Chevron up icon

    If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

    1. Register on our website using your email address and the password.
    2. Search for the title by name or ISBN using the search option.
    3. Select the title you want to purchase.
    4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
    5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
    Where can I access support around an eBook? Chevron down icon Chevron up icon
    • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
    • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
    • To view your account details or to download a new copy of the book go to www.packtpub.com/account
    • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
    What eBook formats do Packt support? Chevron down icon Chevron up icon

    Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

    You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

    What are the benefits of eBooks? Chevron down icon Chevron up icon
    • You can get the information you need immediately
    • You can easily take them with you on a laptop
    • You can download them an unlimited number of times
    • You can print them out
    • They are copy-paste enabled
    • They are searchable
    • There is no password protection
    • They are lower price than print
    • They save resources and space
    What is an eBook? Chevron down icon Chevron up icon

    Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

    When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

    For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.