Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Spark for Data Science
Spark for Data Science

Spark for Data Science: Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Arrow left icon
Profile Icon Duvvuri Profile Icon Singhal
Arrow right icon
$19.99 per month
Paperback Sep 2016 344 pages 1st Edition
eBook
R$80 R$245.99
Paperback
R$306.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Duvvuri Profile Icon Singhal
Arrow right icon
$19.99 per month
Paperback Sep 2016 344 pages 1st Edition
eBook
R$80 R$245.99
Paperback
R$306.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
R$80 R$245.99
Paperback
R$306.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Spark for Data Science

Chapter 2. The Spark Programming Model

Large-scale data processing using thousands of nodes with built-in fault tolerance has become widespread due to the availability of open source frameworks, with Hadoop being a popular choice. These frameworks are quite successful in executing specific tasks such as Extract, Transform, and Load (ETL) and storage applications that deal with web-scale data. However, developers were left with a myriad of tools to work with, along with the well-established Hadoop ecosystem. There was a need for a single, general-purpose development platform that caters to batch, streaming, interactive, and iterative requirements. This was the motivation behind Spark.

The previous chapter outlined the big data analytics challenges and how Spark addressed most of them at a very high level. In this chapter, we will examine the design goals and choices involved in the making of Spark to get a clearer understanding of its suitability as a data science platform for big...

The programming paradigm

For Spark to address the big data challenges and serve as a platform for data science and other scalable applications, it was built with well-thought-out design considerations and language support.

There are Spark APIs designed for varieties of application developers to create Spark-based applications using standard API interfaces. Spark provides APIs for Scala, Java, R and Python programming languages, as explained in the following sections.

Supported programming languages

With built-in support for so many languages, Spark can be used interactively through a shell, which is otherwise known as Read-Evaluate-Print-Loop (REPL), in a way that will feel familiar to developers of any language. The developers can use the language of their choice, leverage existing libraries, and seamlessly interact with Spark and its ecosystem. Let us see the ones supported on Spark and how they fit into the Spark ecosystem.

Scala

Spark itself is written in Scala, a Java Virtual Machine ...

The Spark engine

To program with Spark, a basic understanding of Spark components is needed. In this section, some of the important Spark components along with their execution mechanism will be explained so that developers and data scientists can write programs and build applications.

Before getting into the details, we suggest you take a look at the following diagram so that the descriptions of the Spark gears are more comprehensible as you read further:

The Spark engine

Driver program

The Spark shell is an example of a driver program. A driver program is a process that executes in the JVM and runs the user's main function on it. It has a SparkContext object which is a connection to the underlying cluster manager. A Spark application is initiated when the driver starts and it completes when the driver stops. The driver, through an instance of SparkContext, coordinates all processes within a Spark application.

Primarily, an RDD lineage Directed Acyclic Graph (DAG) is built on the driver side with data...

The RDD API

The RDD is a read-only, partitioned, fault-tolerant collection of records. From a design perspective, there was a need for a single data structure abstraction that hides the complexity of dealing with a wide variety of data sources, be it HDFS, filesystems, RDBMS, NOSQL data structures, or any other data source. The user should be able to define the RDD from any of these sources. The goal was to support a wide array of operations and let users compose them in any order.

RDD basics

Each dataset is represented as an object in Spark's programming interface called RDD. Spark provides two ways for creating RDDs. One way is to parallelize an existing collection. The other way is to reference a dataset in an external storage system such as a filesystem.

An RDD is composed of one or more data sources, maybe after performing a series of transformations including several operators. Every RDD or RDD partition knows how to recreate itself in case of failure. It has the log of transformations...

RDD operations

Spark programming usually starts by choosing a suitable interface that you are comfortable with. If you intend to do interactive data analysis, then a shell prompt would be the obvious choice. However, choosing a Python shell (PySpark) or Scala shell (Spark-Shell) depends on your proficiency with these languages to some extent. If you are building a full-blown scalable application then proficiency matters a great deal, so you should develop the application in your language of choice between Scala, Java, and Python, and submit it to Spark. We will discuss this aspect in more detail later in the book.

Creating RDDs

In this section, we will use both a Python shell (PySpark) and a Scala shell (Spark-Shell) to create an RDD. Both of these shells have a predefined, interpreter-aware SparkContext that is assigned to a variable sc.

Let us get started with some simple code examples. Note that the code assumes the current working directory is Spark's home directory. The following...

The programming paradigm


For Spark to address the big data challenges and serve as a platform for data science and other scalable applications, it was built with well-thought-out design considerations and language support.

There are Spark APIs designed for varieties of application developers to create Spark-based applications using standard API interfaces. Spark provides APIs for Scala, Java, R and Python programming languages, as explained in the following sections.

Supported programming languages

With built-in support for so many languages, Spark can be used interactively through a shell, which is otherwise known as Read-Evaluate-Print-Loop (REPL), in a way that will feel familiar to developers of any language. The developers can use the language of their choice, leverage existing libraries, and seamlessly interact with Spark and its ecosystem. Let us see the ones supported on Spark and how they fit into the Spark ecosystem.

Scala

Spark itself is written in Scala, a Java Virtual Machine (JVM...

The Spark engine


To program with Spark, a basic understanding of Spark components is needed. In this section, some of the important Spark components along with their execution mechanism will be explained so that developers and data scientists can write programs and build applications.

Before getting into the details, we suggest you take a look at the following diagram so that the descriptions of the Spark gears are more comprehensible as you read further:

Driver program

The Spark shell is an example of a driver program. A driver program is a process that executes in the JVM and runs the user's main function on it. It has a SparkContext object which is a connection to the underlying cluster manager. A Spark application is initiated when the driver starts and it completes when the driver stops. The driver, through an instance of SparkContext, coordinates all processes within a Spark application.

Primarily, an RDD lineage Directed Acyclic Graph (DAG) is built on the driver side with data sources...

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform data analysis and build predictive models on huge datasets that leverage Apache Spark
  • Learn to integrate data science algorithms and techniques with the fast and scalable computing features of Spark to address big data challenges
  • Work through practical examples on real-world problems with sample code snippets

Description

This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.

Who is this book for?

This book is for anyone who wants to leverage Apache Spark for data science and machine learning. If you are a technologist who wants to expand your knowledge to perform data science operations in Spark, or a data scientist who wants to understand how algorithms are implemented in Spark, or a newbie with minimal development experience who wants to learn about Big Data Analytics, this book is for you!

What you will learn

  • Consolidate, clean, and transform your data acquired from various data sources
  • Perform statistical analysis of data to find hidden insights
  • Explore graphical techniques to see what your data looks like
  • Use machine learning techniques to build predictive models
  • Build scalable data products and solutions
  • Start programming using the RDD, DataFrame and Dataset APIs
  • Become an expert by improving your data analytical skills

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 30, 2016
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781785885655
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Sep 30, 2016
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781785885655
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total R$ 798.97
Apache Spark Machine Learning Blueprints
R$245.99
Fast Data Processing with Spark 2
R$245.99
Spark for Data Science
R$306.99
Total R$ 798.97 Stars icon

Table of Contents

11 Chapters
1. Big Data and Data Science – An Introduction Chevron down icon Chevron up icon
2. The Spark Programming Model Chevron down icon Chevron up icon
3. Introduction to DataFrames Chevron down icon Chevron up icon
4. Unified Data Access Chevron down icon Chevron up icon
5. Data Analysis on Spark Chevron down icon Chevron up icon
6. Machine Learning Chevron down icon Chevron up icon
7. Extending Spark with SparkR Chevron down icon Chevron up icon
8. Analyzing Unstructured Data Chevron down icon Chevron up icon
9. Visualizing Big Data Chevron down icon Chevron up icon
10. Putting It All Together Chevron down icon Chevron up icon
11. Building Data Science Applications Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.