You're reading from Data Engineering with Scala and Spark Build streaming and batch pipelines that process massive amounts of data using Scala

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781804612583

Length 300 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Engineering

Authors (3):

Eric Tome

Rupam Bhattacharjee

David Radford

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

2. Chapter 1: Scala Essentials for Data Engineers FREE CHAPTER

3. Chapter 2: Environment Setup

4. Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

5. Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

6. Chapter 4: Working with Databases

7. Chapter 5: Object Stores and Data Lakes

8. Chapter 6: Understanding Data Transformation

9. Chapter 7: Data Profiling and Data Quality

10. Part 3 – Software Engineering Best Practices for Data Engineering in Scala

11. Chapter 8: Test-Driven Development, Code Health, and Maintainability

12. Chapter 9: CI/CD with GitHub

13. Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

14. Chapter 10: Data Pipeline Orchestration

15. Chapter 11: Performance Tuning

16. Part 5 – End-to-End Data Pipelines

17. Chapter 12: Building Batch Pipelines Using Spark and Scala

18. Chapter 13: Building Streaming Pipelines Using Spark and Scala

19. Index

Why subscribe?

20. Other Books You May Enjoy

Introducing TDD

TDD is a topic that is broad and deserves its own book. However, we will cover the basics so that you can apply TDD to your Scala data engineering projects.

One essential aspect of TDD in data engineering is testing the data transformations and manipulations within the pipelines you create. This involves creating unit tests that verify the correctness and accuracy of data transformations, aggregations, filters, and other data manipulation operations. Unit tests also ensure the code you create or change doesn’t break any existing processes that were previously created by you or anyone else on your team.

To accomplish this, it is important to develop code that is easily testable. You can do this by creating functions that perform one action and then composing multiple functions together to build your applications. Doing so will help to maintain code health and maintainability because you have small functions that make refactoring those functions easy.

...