Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hands-On Big Data Analytics with PySpark

You're reading from   Hands-On Big Data Analytics with PySpark Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher Packt
ISBN-13 9781838644130
Length 182 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Authors (3):
Arrow left icon
James Cross James Cross
Author Profile Icon James Cross
James Cross
Bartłomiej Potaczek Bartłomiej Potaczek
Author Profile Icon Bartłomiej Potaczek
Bartłomiej Potaczek
Rudy Lai Rudy Lai
Author Profile Icon Rudy Lai
Rudy Lai
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Installing Pyspark and Setting up Your Development Environment FREE CHAPTER 2. Getting Your Big Data into the Spark Environment Using RDDs 3. Big Data Cleaning and Wrangling with Spark Notebooks 4. Aggregating and Summarizing Data into Useful Reports 5. Powerful Exploratory Data Analysis with MLlib 6. Putting Structure on Your Big Data with SparkSQL 7. Transformations and Actions 8. Immutable Design 9. Avoiding Shuffle and Reducing Operational Expenses 10. Saving Data in the Correct Format 11. Working with the Spark Key/Value API 12. Testing Apache Spark Jobs 13. Leveraging the Spark GraphX API 14. Other Books You May Enjoy

What this book covers

Chapter 1, Installing Pyspark and Setting up Your Development Environment, covers the installation of PySpark and learning about core concepts in Spark, including resilient distributed datasets (RDDs), SparkContext, and Spark tools, such as SparkConf and SparkShell.

Chapter 2, Getting Your Big Data into the Spark Environment Using RDDs, explains how to get your big data into the Spark environment using RDDs using a wide array of tools to interact and modify this data so that useful insights can be extracted.

Chapter 3, Big Data Cleaning and Wrangling with Spark Notebooks, covers how to use Spark in notebook applications, thereby facilitating the effective use of RDDs.

Chapter 4, Aggregating and Summarizing Data into Useful Reports, describes how to calculate averages with the map and reduce function, perform faster average computation, and use a pivot table with key/value pair data points.

Chapter 5, Powerful Exploratory Data Analysis with MLlib, examines Spark's ability to perform regression tasks with models including linear regression and SVMs.

Chapter 6, Putting Structure on Your Big Data with SparkSQL, explains how to manipulate DataFrames with Spark SQL schemas, and use the Spark DSL to build queries for structured data operations.

Chapter 7, Transformations and Actions, looks at Spark transformations to defer computations and then considers transformations that should be avoided. We will also use the reduce and reduceByKey methods to carry out calculations from a dataset.

Chapter 8, Immutable Design, explains how to use DataFrame operations for transformations with a view to discussing immutability in a highly concurrent environment.

Chapter 9, Avoid Shuffle and Reduce Operational Expenses, covers shuffling and the operations of Spark API that should be used. We will then test operations that cause a shuffle in Apache Spark to know which operations should be avoided.

Chapter 10, Saving Data in the Correct Format, explains how to save data in the correct format and also save data in plain text using Spark's standard API.

Chapter 11, Working with the Spark Key/Value API, discusses the transformations available on key/value pairs. We will look at actions on key/value pairs and look at the available partitioners on key/value data.

Chapter 12, Testing Apache Spark Jobs, goes into further detail about testing Apache Spark jobs in different versions of Spark.

Chapter 13, Leveraging the Spark GraphX API, covers how to leverage Spark GraphX API. We will carry out experiments with the Edge API and Vertex API.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image