What This Book Covers
This book is aligned with the revised syllabus of Exam DP-203: Azure Data Engineer Associate Certification and comprises the following chapters:
Chapter 1, Introducing Azure Basics, will introduce you to Azure and explains its capabilities. This is a refresher chapter designed to renew your knowledge of some of the core Azure concepts, including VMs, data storage, compute options, the Azure portal, accounts, and subscriptions. You will be building on top of these technologies in future chapters.
Chapter 2, Implementing a Partition Strategy, will explore the implementation of partition strategies for efficient data management. You will delve into strategies for optimizing analytical workloads through data partitioning and discuss approaches to improve performance for streaming workloads. Additionally, you will examine the utilization of partitioning within Azure Synapse Analytics for enhanced data processing, and identify scenarios where partitioning is necessary in ADLS Gen2 for improved data organization and processing.
Chapter 3, Designing and Implementing the Data Exploration Layer, will focus on creating and executing queries using SQL Serverless and Spark cluster technologies. You will also review database templates in Azure Synapse Analytics and their implementation as part of this exploration. Additionally, you will learn to push new or updated data lineage to Microsoft Purview and explore the importance of searching and browsing metadata in the Microsoft Purview data catalog for effective data management.
Chapter 4, Ingesting and Transforming Data, will focus on designing and implementing incremental loads for efficient data ingestion. You will utilize Apache Spark, Transact-SQL (T-SQL) in Azure Synapse Analytics, Stream Analytics, and ADF for data transformations. You will also look into the various aspects of data pipelines, such as cleansing data, parsing data, encoding, and decoding data, and normalizing and denormalizing values. Additionally, you will focus on configuring error handling for transformations, including handling duplicate, missing, and late-arriving data. Finally, you will delve into performing exploratory analysis for effective data analysis.
Chapter 5, Developing a Batch Processing Solution, will utilize a combination of Azure Data Lake Storage, ADB, Azure Synapse Analytics, and ADF. You will use PolyBase to load data into an SQL pool and implement Azure Synapse Link for efficient data loading. Additionally, you will learn how to create and test data pipelines, integrate notebooks, and configure batch retention as part of your data pipeline development. Error handling is examined as well, including managing upserted data, reverting data to a previous state, and configuring exception handling for robust data processing.
Chapter 6, Developing a Stream Processing Solution, will focus on creating solutions using Stream Analytics and Azure Event Hubs for real-time data processing. You will use Spark Structured Streaming for data processing. Additionally, you will address schema management, including handling schema drift and managing time series data effectively. Finally, you will learn about pipeline optimization techniques, such as configuring checkpoints, watermarking, and optimizing pipelines for analytical and transactional purposes.
Chapter 7, Managing Batches and Pipelines, will cover triggering and handling failed batch loads to ensure data integrity. For pipeline management, you will focus on managing and scheduling data pipelines using ADF and Azure Synapse Pipelines. Additionally, you will learn how to implement version control for pipeline artifacts to track changes effectively and explore managing Spark jobs within a pipeline for efficient Spark job management.
Chapter 8, Implementing Data Security, will explore strategies for data masking and encryption to ensure data protection and focuses on how to design and implement data encryption, both at rest and in transit, data auditing, data masking, and data retention. You will implement security controls such as row-level, column-level security, and Azure RBAC to restrict access effectively. Additionally, you will cover access management, including managing POSIX-like Access Control Lists (ACLs) for Data Lake Storage Gen2 and securing endpoints to control data access. Finally, you will address sensitive data management, including handling sensitive information within DataFrames and managing encrypted data for enhanced security.
Chapter 9, Monitoring Data Storage and Data Processing, covers the implementation of logging used by Azure Monitor, focusing on setting up and utilizing its features to track the activities and health of Azure services effectively. You will explore the performance of data movement processes within Azure services and monitor and update statistics about data across a system to reflect its current state accurately. You will delve into monitoring data pipeline performance, identifying bottlenecks and ensuring smooth data flow, and you will learn how to interpret Azure Monitor metrics and logs to make informed decisions. Finally, you will implement a pipeline alert strategy for prompt responses to potential issues.
Chapter 10, Optimizing and Troubleshooting Data Storage and Data Processing, will explore strategies for compacting small files to improve processing efficiency and system performance. You will review techniques for handling skew in data distribution to mitigate processing delays, explore ways to manage data spillage and optimize resource management to maximize performance, use indexers to reduce data search times, and use caching to speed up query execution. Additionally, you will learn about troubleshooting failed Spark jobs, diagnosing, and resolving issues that cause them to fail, troubleshooting failed pipeline runs (including activities executed in external services), and providing insights on identifying and fixing problems to ensure smooth pipeline execution.
Minimum Hardware Requirements
For an optimal experience, the following hardware configuration is recommended:
- Processor: Dual-core or better
- Memory:
4
GB
RAM - Storage:
10
GB
available space
Minimum Software Requirements
You must have the following software installed:
Chapter |
Software Required |
OS Required |
1–10 |
Azure account (free or paid) |
Windows, macOS, and Linux |
1–10 |
Azure Command-Line Interface (CLI) |
Windows, macOS, and Linux |
1–10 |
Visual Studio Code (VS Code) |
Windows, macOS, and Linux |
Note
You can find the Azure CLI installation link in GitHub as part of Chapter 1, Introducing Azure Basics, at https://packt.link/muMNE.