You're reading from Azure Data Engineer Associate Certification Guide A hands-on reference guide to developing your data engineering skills and preparing for the DP-203 exam

Product type Paperback

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Length 574 pages

Edition 1st Edition

Tools

Azure

Concepts

Big Data

Author (1):

Alex

View More author details

Table of Contents (23) Chapters

Preface

1. Part 1: Azure Basics

2. Chapter 1: Introducing Azure Basics FREE CHAPTER

3. Part 2: Data Storage

4. Chapter 2: Designing a Data Storage Structure

5. Chapter 3: Designing a Partition Strategy

6. Chapter 4: Designing the Serving Layer

7. Chapter 5: Implementing Physical Data Storage Structures

8. Chapter 6: Implementing Logical Data Structures

9. Chapter 7: Implementing the Serving Layer

10. Part 3: Design and Develop Data Processing (25-30%)

11. Chapter 8: Ingesting and Transforming Data

12. Chapter 9: Designing and Developing a Batch Processing Solution

13. Chapter 10: Designing and Developing a Stream Processing Solution

14. Chapter 11: Managing Batches and Pipelines

15. Part 4: Design and Implement Data Security (10-15%)

16. Chapter 12: Designing Security for Data Policies and Standards

17. Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

18. Chapter 13: Monitoring Data Storage and Data Processing

19. Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

20. Part 6: Practice Exercises

21. Chapter 15: Sample Questions with Solutions

22. Other Books You May Enjoy

Transforming data by using Apache Spark

Apache Spark supports transformations with three different Application Programming Interfaces (APIs): Resilient Distributed Datasets (RDDs), DataFrames, and Datasets. We will learn about RDDs and DataFrame transformations in this chapter. Datasets are just extensions of DataFrames, with additional features like being type-safe (where the compiler will strictly check for data types) and providing an object-oriented (OO) interface.

The information in this section applies to all flavors of Spark available on Azure: Synapse Spark, Azure Databricks Spark, and HDInsight Spark.

What are RDDs?

RDDs are an immutable fault-tolerant collection of data objects that can be operated on in parallel by Spark. These are the most fundamental data structures that Spark operates on. RDDs support a wide variety of data formats such as JSON, comma-separated values (CSV), Parquet, and so on.