You're reading from Data Engineering with Scala and Spark Build streaming and batch pipelines that process massive amounts of data using Scala

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781804612583

Length 300 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Engineering

Authors (3):

Rupam Bhattacharjee

David Radford

Eric Tome

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

2. Chapter 1: Scala Essentials for Data Engineers FREE CHAPTER

3. Chapter 2: Environment Setup

4. Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

5. Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

6. Chapter 4: Working with Databases

7. Chapter 5: Object Stores and Data Lakes

8. Chapter 6: Understanding Data Transformation

9. Chapter 7: Data Profiling and Data Quality

10. Part 3 – Software Engineering Best Practices for Data Engineering in Scala

11. Chapter 8: Test-Driven Development, Code Health, and Maintainability

12. Chapter 9: CI/CD with GitHub

13. Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

14. Chapter 10: Data Pipeline Orchestration

15. Chapter 11: Performance Tuning

16. Part 5 – End-to-End Data Pipelines

17. Chapter 12: Building Batch Pipelines Using Spark and Scala

18. Chapter 13: Building Streaming Pipelines Using Spark and Scala

19. Index

Why subscribe?

20. Other Books You May Enjoy

How do Spark applications work?

A Spark application runs on a Spark cluster, which is a connected group of nodes. These nodes can be virtual machines (VMs) or bare-metal servers. In terms of Spark architecture, there is one driver node and one to n executors that run on your Spark cluster. The driver will control the executors and provide instructions (defined in your Spark application) to the executors. Generally, the driver never actually touches the data you are processing. The executors are where data is manipulated, given instructions from the driver. This is depicted in the following diagram:

Figure 3.1 – Spark driver and executor architecture

Note that the following calculations assume linear scalability, which is not always the case. The actual gain from distributing the work across many nodes depends on the nature of the data and the transformations applied to the data.

On open source Spark, you can configure the number of executors...