Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Apache Spark 2.x Cookbook
Apache Spark 2.x Cookbook

Apache Spark 2.x Cookbook: Over 70 cloud-ready recipes for distributed Big Data processing and analytics

eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Apache Spark 2.x Cookbook

Developing Applications with Spark

In this chapter, we will cover the following recipes:

  • Exploring the Spark shell
  • Developing a Spark applications in Eclipse with Maven
  • Developing a Spark applications in Eclipse with SBT
  • Developing a Spark application in IntelliJ IDEA with Maven
  • Developing a Spark application in IntelliJ IDEA with SBT
  • Developing applications using the Zeppelin notebook
  • Setting up Kerberos to do authentication
  • Enabling Kerberos authentication for Spark

Introduction

Before we start this chapter, it is important that we discuss some trends that directly affect how we develop applications. 

Big data applications can be divided into the following three categories:

  • Batch
  • Interactive
  • Streaming or continuous applications

When Hadoop was designed, the primary focus was to provide cost-effective storage for large amounts of data. This remained the main show until it was upended by S3 and other cheaper and more reliable cloud storage alternatives. Compute on this large amounts of data in the Hadoop environment was primarily in the form of MapReduce jobs. Since Spark took the ball from Hadoop (OK! Snatched!) and started running with it, Spark also reflected batch orientation focus in the initial phase, but it did a better job than Hadoop in the case of exploiting in-memory storage. 

The most compelling factor of the success of Hadoop was that the cost of storage...

Exploring the Spark shell

Spark comes bundled with a read–eval–print loop (REPL) shell, which is a wrapper around the Scala shell. Though the Spark shell looks like a command line for simple things, in reality, a lot of complex queries can also be executed using it. A lot of times, the Spark shell is used in the initial development phase and once the code is stabilized, it is written as a class file and bundled as a jar to be run using spark-submit flag. This chapter explores different development environments in which Spark applications can be developed.

How to do it...

Hadoop MapReduce's word count, which takes at least three class files and one configuration file, namely project object model (POM), becomes very simple...

Developing a Spark applications in Eclipse with Maven

Maven as a build tool has become the de-facto standard over the years. It's not surprising if we look a little deeper into what Maven brings. Maven has two primary features and they are:

  • Convention over configuration: Tools built prior to Maven gave developers the freedom to choose where to put source files, test files, compiled files, and so on. Maven takes away this freedom. Because of this, all the confusion about locations also disappears. In Maven, there is a specific directory structure for everything. The following table shows a few of the most common locations:
/src/main/scala Source code in Scala
/src/main/java Source code in Java
/src/main/resources Resources to be used by the source code, such as configuration files
/src/test/scala Test code in Scala
/src/test/java Test code in Java
/src/test/resources Resources to be used by the...

Developing a Spark applications in Eclipse with SBT

SBT is a build tool made especially for Scala-based development. SBT follows Maven-based naming conventions and declarative dependency management.

SBT provides the following enhancements over Maven:

  • Dependencies are in the form of key-value pairs in the build.sbt file, as opposed to the pom.xml file in Maven
  • It provides a shell that makes it very handy to perform build operations
  • For simple projects without dependencies, you do not even need the build.sbt file

In the build.sbt file, the first line is the project definition:

lazy val root = (project in file(".")) 

Each project has an immutable map of key-value pairs. This map is changed by the settings in SBT, as follows:

lazy val root = (project in file(".")). 
settings(
name := "wordcount"
)

Every change in the settings field leads to a new map...

Developing a Spark application in IntelliJ IDEA with Maven

IntelliJ IDEA comes bundled with support for Maven. We will see how to create a new Maven project in this recipe.

How to do it...

Perform the following steps to develop a Spark application on IntelliJ IDEA with Maven:

  1. Select Maven in the new project window and click on Next:
  1. Enter the three dimensions of the project:
  1. Enter the project name and location:
  1. Click on Finish and the Maven project will be ready.

Introduction


Before we start this chapter, it is important that we discuss some trends that directly affect how we develop applications. 

Big data applications can be divided into the following three categories:

  • Batch
  • Interactive
  • Streaming or continuous applications

When Hadoop was designed, the primary focus was to provide cost-effective storage for large amounts of data. This remained the main show until it was upended by S3 and other cheaper and more reliable cloud storage alternatives. Compute on this large amounts of data in the Hadoop environment was primarily in the form of MapReduce jobs. Since Spark took the ball from Hadoop (OK! Snatched!) and started running with it, Spark also reflected batch orientation focus in the initial phase, but it did a better job than Hadoop in the case of exploiting in-memory storage. 

Note

The most compelling factor of the success of Hadoop was that the cost of storage was hundreds of times lower than traditional data warehouse technologies, such as Teradata...

Exploring the Spark shell


Spark comes bundled with a read–eval–print loop (REPL) shell, which is a wrapper around the Scala shell. Though the Spark shell looks like a command line for simple things, in reality, a lot of complex queries can also be executed using it. A lot of times, the Spark shell is used in the initial development phase and once the code is stabilized, it is written as a class file and bundled as a jar to be run using spark-submit flag. This chapter explores different development environments in which Spark applications can be developed.

How to do it...

Hadoop MapReduce's word count, which takes at least three class files and one configuration file, namely project object model (POM), becomes very simple with the Spark shell. In this recipe, we are going to create a simple one-line text file, upload it to the Hadoop distributed file system (HDFS), and use Spark to count the occurrences of words. Let's see how:

  1. Create the words directory using the following command:
$ mkdir words...

Developing a Spark applications in Eclipse with Maven


Maven as a build tool has become the de-facto standard over the years. It's not surprising if we look a little deeper into what Maven brings. Maven has two primary features and they are:

  • Convention over configuration: Tools built prior to Maven gave developers the freedom to choose where to put source files, test files, compiled files, and so on. Maven takes away this freedom. Because of this, all the confusion about locations also disappears. In Maven, there is a specific directory structure for everything. The following table shows a few of the most common locations:

/src/main/scala

Source code in Scala

/src/main/java

Source code in Java

/src/main/resources

Resources to be used by the source code, such as configuration files

/src/test/scala

Test code in Scala

/src/test/java

Test code in Java

/src/test/resources

Resources to be used by the test code, such as configuration files

  • Declarative dependency management: In Maven, every library is defined...

Developing a Spark applications in Eclipse with SBT


SBT is a build tool made especially for Scala-based development. SBT follows Maven-based naming conventions and declarative dependency management.

SBT provides the following enhancements over Maven:

  • Dependencies are in the form of key-value pairs in the build.sbt file, as opposed to the pom.xml file in Maven
  • It provides a shell that makes it very handy to perform build operations
  • For simple projects without dependencies, you do not even need the build.sbt file

In the build.sbt file, the first line is the project definition:

lazy val root = (project in file("."))

Each project has an immutable map of key-value pairs. This map is changed by the settings in SBT, as follows:

lazy val root = (project in file(".")). 
  settings( 
    name := "wordcount" 
  ) 

Every change in the settings field leads to a new map, as it's an immutable map.

How to do it...

Here's how we go about adding the sbteclipse plugin:

  1. Add this to the global plugin file:
$ mkdir /home...
Left arrow icon Right arrow icon

Key benefits

  • Contains quick solutions to solving even the most complex Big Data processing problems using Apache Spark
  • Leverage the power of Apache Spark as a unified compute engine and perform streaming analytics, machine learning and graph processing with ease
  • From installing and setting up Spark to fine-tuning its performance, this practical guide is all you need to become a master in using Apache Spark

Description

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark. Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.

Who is this book for?

This book is for data engineers, data scientists, and Big Data professionals who want to leverage the power of Apache Spark 2.x for real-time Big Data processing. If you’re looking for quick solutions to common problems while using Spark 2.x effectively, this book will also help you. The book assumes you have a basic knowledge of Scala as a programming language.

What you will learn

  • Install and configure Apache Spark with various cluster managers & on AWS
  • Set up a development environment for Apache Spark including Databricks Cloud notebook
  • Find out how to operate on data in Spark with schemas
  • Get to grips with real-time streaming analytics using Spark Streaming & Structured Streaming
  • Master supervised learning and unsupervised learning using MLlib
  • Build a recommendation engine using MLlib
  • Graph processing using GraphX and GraphFrames libraries
  • Develop a set of common applications or project types, and solutions that solve complex big data problems

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : May 31, 2017
Length: 294 pages
Edition : 1st
Language : English
ISBN-13 : 9781787127517
Vendor :
Apache
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : May 31, 2017
Length: 294 pages
Edition : 1st
Language : English
ISBN-13 : 9781787127517
Vendor :
Apache
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 125.97
Apache Spark 2.x Cookbook
€41.99
Apache Spark 2.x Machine Learning Cookbook
€41.99
Mastering Apache Spark 2.x
€41.99
Total 125.97 Stars icon
Banner background image

Table of Contents

12 Chapters
Getting Started with Apache Spark Chevron down icon Chevron up icon
Developing Applications with Spark Chevron down icon Chevron up icon
Spark SQL Chevron down icon Chevron up icon
Working with External Data Sources Chevron down icon Chevron up icon
Spark Streaming Chevron down icon Chevron up icon
Getting Started with Machine Learning Chevron down icon Chevron up icon
Supervised Learning with MLlib — Regression Chevron down icon Chevron up icon
Supervised Learning with MLlib — Classification Chevron down icon Chevron up icon
Unsupervised Learning Chevron down icon Chevron up icon
Recommendations Using Collaborative Filtering Chevron down icon Chevron up icon
Graph Processing Using GraphX and GraphFrames Chevron down icon Chevron up icon
Optimizations and Performance Tuning Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.3
(3 Ratings)
5 star 33.3%
4 star 33.3%
3 star 0%
2 star 0%
1 star 33.3%
Delayen Weeden Jun 13, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I wanted to get a better understanding of graph processing and I was directed to this book. It answered alot of questions that I had, and it informed me on other areas that I needed to know. Its a great blueprint for people who work in this field. Very informative
Amazon Verified review Amazon
S. Jamal Sep 22, 2017
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Not bad- but needs editing. I have the print version - there are duplicate sentences in paragraphs that obviously should have been removed.Other than that, this is a decent reference and a good introductory book. It is not deep as regards Data Science (but perhaps it is not meant to be).
Amazon Verified review Amazon
TC Dec 11, 2018
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Compared with other "Cookbook": No depth. More like an introduction concept book.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.