You're reading from Data Engineering with Scala and Spark Build streaming and batch pipelines that process massive amounts of data using Scala

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781804612583

Length 300 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Engineering

Authors (3):

Rupam Bhattacharjee

David Radford

Eric Tome

Preface

1. Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

2. Chapter 1: Scala Essentials for Data Engineers FREE CHAPTER

3. Chapter 2: Environment Setup

4. Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

5. Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

6. Chapter 4: Working with Databases

7. Chapter 5: Object Stores and Data Lakes

8. Chapter 6: Understanding Data Transformation

9. Chapter 7: Data Profiling and Data Quality

10. Part 3 – Software Engineering Best Practices for Data Engineering in Scala

11. Chapter 8: Test-Driven Development, Code Health, and Maintainability

12. Chapter 9: CI/CD with GitHub

13. Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

14. Chapter 10: Data Pipeline Orchestration

15. Chapter 11: Performance Tuning

16. Part 5 – End-to-End Data Pipelines

17. Chapter 12: Building Batch Pipelines Using Spark and Scala

18. Chapter 13: Building Streaming Pipelines Using Spark and Scala

19. Index

20. Other Books You May Enjoy

The end-to-end pipeline

Before we write any code, we need to consider the following:

Data is loaded daily; our process will run after this happens… Do we schedule or trigger our processing?
We will need a way to specify the date we’ll be processing
We need to consider the possibility of having to reprocess a date if there is an issue with the original dataset
Do we need to write a mechanism to process more than one day?
Do we need to write a mechanism to reprocess the whole dataset?
What data quality rules should be put in place?
How and where are we going to transform our data?

Here’s a high-level overview of how we can structure our Spark/Scala application to meet these requirements:

Scheduling or triggering: We can use a scheduling tool such as ADF, Argo, Apache Airflow, or cron jobs to trigger our Spark application daily after data loading is complete. We’ll show an example in Argo in the Orchestrating our...