You're reading from Data Engineering with Python Work with massive datasets to design data models and automate data pipelines using Python

Product type Paperback

Published in Oct 2020

Publisher Packt

ISBN-13 9781839214189

Length 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Paul Crickard

View More author details

Table of Contents (21) Chapters

Preface

1. Section 1: Building Data Pipelines – Extract Transform, and Load

2. Chapter 1: What is Data Engineering? FREE CHAPTER

3. Chapter 2: Building Our Data Engineering Infrastructure

4. Chapter 3: Reading and Writing Files

5. Chapter 4: Working with Databases

6. Chapter 5: Cleaning, Transforming, and Enriching Data

7. Chapter 6: Building a 311 Data Pipeline

8. Section 2:Deploying Data Pipelines in Production

9. Chapter 7: Features of a Production Pipeline

10. Chapter 8: Version Control with the NiFi Registry

11. Chapter 9: Monitoring Data Pipelines

12. Chapter 10: Deploying Data Pipelines

13. Chapter 11: Building a Production Data Pipeline

14. Section 3:Beyond Batch – Building Real-Time Data Pipelines

15. Chapter 12: Building a Kafka Cluster

16. Chapter 13: Streaming Data with Apache Kafka

17. Chapter 14: Data Processing with Apache Spark

18. Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

19. Other Books You May Enjoy

Leave a review - let other readers know what you think

Appendix

Building data pipelines in Apache Airflow

In the previous chapter, you built your first Airflow data pipeline using a Bash and Python operator. This time, you will combine two Python operators to extract data from PostgreSQL, save it as a CSV file, then read it in and write it to an Elasticsearch index. The complete pipeline is shown in the following screenshot:

Figure 4.6 – Airflow DAG

The preceding Directed Acyclic Graph (DAG) looks very simple; it is only two tasks, and you could combine the tasks into a single function. This is not a good idea. In Section 2, Deploying Pipelines into Production, you will learn about modifying your data pipelines for production. A key tenant of production pipelines is that each task should be atomic; that is, each task should be able to stand on its own. If you had a single function that read a database and inserted the results, when it fails, you have to track down whether the query failed or the insert failed. As...