What do you get with Print?

Instant access to your digital copy whilst your Print order is Shipped

Paperback book shipped to your preferred address

Redeem a companion digital copy on all Print orders

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Abstracting Data with RDDs

In this chapter, we will cover how to work with Apache Spark Resilient Distributed Datasets. You will learn the following recipes:

Creating RDDs
Reading data from files
Overview of RDD transformations
Overview of RDD actions
Pitfalls of using RDDs

Key benefits

Perform effective data processing, machine learning, and analytics using PySpark

Overcome challenges in developing and deploying Spark solutions using Python

Explore recipes for efficiently combining Python and Apache Spark to process data

Description

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

What you will learn

Configure a local instance of PySpark in a virtual environment

Install and configure Jupyter in local and multi-node environments

Create DataFrames from JSON and a dictionary using pyspark.sql

Explore regression and clustering models available in the ML module

Use DataFrames to transform data used for modeling

Connect to PubNub and perform aggregations on streams

What do you get with Print?

Instant access to your digital copy whilst your Print order is Shipped

Paperback book shipped to your preferred address

Redeem a companion digital copy on all Print orders

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Frequently bought together

Hands-On Big Data Analytics with PySpark

€19.99

€32.99

€36.99

Total € 89.97

Dimitri Shvorob Oct 02, 2020

Wishing to learn Spark, I signed up for Databricks Associate Spark Developer certification exam - Python flavor - and ordered off Amazon a number of Spark books, avoiding Scala-based titles, and older titles pre-dating the DataFrame API. I ended up with the following list:"Learning PySpark" by Drabas and Lee, published by Packt in 2017"Frank Kane's Taming Big Data with Apache Spark and Python" by (no surprise) Kane, Packt, 2017"Data Analytics with Spark Using Python" by Aven, Addison Wesley, 2018"PySpark Cookbook" by (once again) Drabas and Lee, Packt, 2018"Developing Spark Applications with Python" by Morera and Campos, self-published in 2019"PySpark Recipes" by Mishra, Apress, 2017"Learning Spark" by Damjil et al., O'Reilly, 2020"Beginning Apache Spark Using Azure Databricks" by Ilijason, Apress, 2020"Spark: The Definitive Guide" by Chambers and Zaharia, O'Reilly, 2018Databricks themselves point to "Learning Spark" and "Spark: The Definitive Guide" as preparation aids, so I started with these, skimming both books - and strongly preferring "The Definitive Guide" - and then took a look at the others."PySpark Cookbook" is an easy "pass". It is not as low-quality as the books by Mishra or by Morera and Campo, but it is still a low-quality, low-value-added affair of the type routinely churned out by Packt. Much of the page count is spent on setup matters, where directions may be out of date - then when we get to Spark, a lot of space is taken up by the old RDD interface. Strikingly, Spark SQL gets all of 3 pages (pp. 117-119). Chapter 4 has some more interesting content - several non-trivial data-manipulation tasks that actually merit the "recipe" label - but with that, "core" Spark content ends, and the authors get into streaming, ML and graphs. It's important to remember that Packt pages have less text than pages of books from other publishers: here, 300 "Packt pages" translate to maybe 150 "normal" pages, and that is not a lot.Skip this book, and consider the Databricks-based introduction by Ilijason and the comprehensive but very accessible reference by Chambers and Zaharia.

Amazon Verified review

mmays Apr 17, 2022

Pretty good text, and I like the approach the author takes, but the Kindle version is really awful for the illegible graphics. I've tried them on a Kindle reader, Kindle cloud in a browser, copy and paste, no joy, they are just too small and illegible if magnified.

Victor Tkachenko Jul 06, 2018

This is a plagiary. Guys simply copied all info from the Wiki and trying to make money on it.Shame. No explanation of the code as far as I concern. Don't buy it, You can get more info from Googling...

PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

What do you get with Print?

PySpark Cookbook

Abstracting Data with RDDs

Introduction

Creating RDDs

Getting ready

How to do it...

Page 1 of 7

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs

PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs