You're reading from Data Engineering with Scala and Spark Build streaming and batch pipelines that process massive amounts of data using Scala

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781804612583

Length 300 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Engineering

Authors (3):

Rupam Bhattacharjee

David Radford

Eric Tome

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

2. Chapter 1: Scala Essentials for Data Engineers FREE CHAPTER

3. Chapter 2: Environment Setup

4. Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

5. Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

6. Chapter 4: Working with Databases

7. Chapter 5: Object Stores and Data Lakes

8. Chapter 6: Understanding Data Transformation

9. Chapter 7: Data Profiling and Data Quality

10. Part 3 – Software Engineering Best Practices for Data Engineering in Scala

11. Chapter 8: Test-Driven Development, Code Health, and Maintainability

12. Chapter 9: CI/CD with GitHub

13. Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

14. Chapter 10: Data Pipeline Orchestration

15. Chapter 11: Performance Tuning

16. Part 5 – End-to-End Data Pipelines

17. Chapter 12: Building Batch Pipelines Using Spark and Scala

18. Chapter 13: Building Streaming Pipelines Using Spark and Scala

19. Index

Why subscribe?

20. Other Books You May Enjoy

What’s our IoT use case?

We work for a telecommunications company that has various types of devices at customers’ homes and businesses. Our devices are all attached to our network, and every minute, we get a status update on each device. The device returns the following statuses:

Activation
Deactivation
Plan change
Telecoms activity
Internet activity
Device error

Our operations team will collect this data from our devices and load each status update as an event in Azure Event Hubs. They have enabled Azure Event Hubs to act as an Apache Kafka surface so that we can use Spark’s Kafka connectors to read that data into our data platform for analytics.

The data will be used in three different ways. The first will be an ad hoc analysis against our Silver layer data, which is structured and deduplicated. The second is to identify the total number of device states for each day. The last case is to identify the current state of any...

The rest of the chapter is locked

You're reading from Data Engineering with Scala and Spark Build streaming and batch pipelines that process massive amounts of data using Scala

Table of Contents (21) Chapters

What’s our IoT use case?

Unlock this book and the full library FREE for 7 days

Authors (3)

Personalised recommendations for you