Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark
In this part, Chapter 3 introduces Apache Spark as a scalable data processing framework, covering its basics, Scala application development, and the Dataset/DataFrame APIs. Chapter 4 explores relational databases in data pipelines, highlighting Spark’s JDBC API. Chapter 5 discusses the rise of data lakes and lake houses, while Chapter 6 delves into advanced Spark data transformation. Chapter 7 focuses on data quality with the Deequ library for checks and metrics.
This part has the following chapters:
- Chapter 3, An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL
- Chapter 4, Working with Databases
- Chapter 5, Object Stores and Data Lakes
- Chapter 6, Understanding Data Transformation
- Chapter 7, Data Profiling and Data Quality
...