Packt+ | Advance your knowledge in tech

You're reading from Hands-On Big Data Analytics with PySpark Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Product type Paperback

Published in Mar 2019

Publisher Packt

ISBN-13 9781838644130

Length 182 pages

Edition 1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Authors (3):

James Cross

Bartłomiej Potaczek

Rudy Lai

View More author details

Chapter 1, Installing Pyspark and Setting up Your Development Environment, covers the installation of PySpark and learning about core concepts in Spark, including resilient distributed datasets (RDDs), SparkContext, and Spark tools, such as SparkConf and SparkShell.

Chapter 2, Getting Your Big Data into the Spark Environment Using RDDs, explains how to get your big data into the Spark environment using RDDs using a wide array of tools to interact and modify this data so that useful insights can be extracted.

Chapter 3, Big Data Cleaning and Wrangling with Spark Notebooks, covers how to use Spark in notebook applications, thereby facilitating the effective use of RDDs.

Chapter 4, Aggregating and Summarizing Data into Useful Reports, describes how to calculate averages with the map and reduce function, perform faster average computation, and use a pivot table with key/value pair data points.

Chapter 5, Powerful Exploratory Data Analysis with MLlib, examines Spark's ability to perform regression tasks with models including linear regression and SVMs.

Chapter 6, Putting Structure on Your Big Data with SparkSQL, explains how to manipulate DataFrames with Spark SQL schemas, and use the Spark DSL to build queries for structured data operations.

Chapter 7, Transformations and Actions, looks at Spark transformations to defer computations and then considers transformations that should be avoided. We will also use the reduce and reduceByKey methods to carry out calculations from a dataset.

Chapter 8, Immutable Design, explains how to use DataFrame operations for transformations with a view to discussing immutability in a highly concurrent environment.

Chapter 9, Avoid Shuffle and Reduce Operational Expenses, covers shuffling and the operations of Spark API that should be used. We will then test operations that cause a shuffle in Apache Spark to know which operations should be avoided.

Chapter 10, Saving Data in the Correct Format, explains how to save data in the correct format and also save data in plain text using Spark's standard API.

Chapter 11, Working with the Spark Key/Value API, discusses the transformations available on key/value pairs. We will look at actions on key/value pairs and look at the available partitioners on key/value data.

Chapter 12, Testing Apache Spark Jobs, goes into further detail about testing Apache Spark jobs in different versions of Spark.

Chapter 13, Leveraging the Spark GraphX API, covers how to leverage Spark GraphX API. We will carry out experiments with the Edge API and Vertex API.