You're reading from Data Engineering with Scala and Spark Build streaming and batch pipelines that process massive amounts of data using Scala

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781804612583

Length 300 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Engineering

Authors (3):

Rupam Bhattacharjee

David Radford

Eric Tome

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

2. Chapter 1: Scala Essentials for Data Engineers FREE CHAPTER

3. Chapter 2: Environment Setup

4. Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

5. Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

6. Chapter 4: Working with Databases

7. Chapter 5: Object Stores and Data Lakes

8. Chapter 6: Understanding Data Transformation

9. Chapter 7: Data Profiling and Data Quality

10. Part 3 – Software Engineering Best Practices for Data Engineering in Scala

11. Chapter 8: Test-Driven Development, Code Health, and Maintainability

12. Chapter 9: CI/CD with GitHub

13. Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

14. Chapter 10: Data Pipeline Orchestration

15. Chapter 11: Performance Tuning

16. Part 5 – End-to-End Data Pipelines

17. Chapter 12: Building Batch Pipelines Using Spark and Scala

18. Chapter 13: Building Streaming Pipelines Using Spark and Scala

19. Index

Why subscribe?

20. Other Books You May Enjoy

Ingesting the data

Before we move on to our ingestion code, there is some setup you will have to do to create a source from which to ingest. There are multiple ways to set up a Kafka service to provide a source for our process, but we chose to use Azure Event Hubs. However, you should use whichever service is most convenient for you. If you decided to use Azure Event Hubs, you will need to set up the service by following the instructions at the following link: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quickstart-kafka-enabled-event-hubs.

Once you have an Event Hubs/Kafka service created, you’ll create a namespace and topic that will be used in this example. You’ll also have to load data into that topic so that our ingestion process can consume the data. The code for this is located on our GitHub repository in the Chapter 13 data_generator folder.

The code will be excluded from the text of this book as it’s written in Python. The main reason...