You're reading from Data Ingestion with Python Cookbook

Product type Book

Published in May 2023

Publisher Packt

ISBN-13 9781837632602

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Data Engineering

Author (1):

Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface

1. Part 1: Fundamentals of Data Ingestion

2. Chapter 1: Introduction to Data Ingestion

3. Chapter 2: Principals of Data Access – Accessing Your Data

4. Chapter 3: Data Discovery – Understanding Our Data before Ingesting It

5. Chapter 4: Reading CSV and JSON Files and Solving Problems

6. Chapter 5: Ingesting Data from Structured and Unstructured Databases

7. Chapter 6: Using PySpark with Deﬁned and Non-Deﬁned Schemas

8. Chapter 7: Ingesting Analytical Data

9. Part 2: Structuring the Ingestion Pipeline

10. Chapter 8: Designing Monitored Data Workﬂows

11. Chapter 9: Putting Everything Together with Airﬂow

12. Chapter 10: Logging and Monitoring Your Data Ingest in Airﬂow

13. Chapter 11: Automating Your Data Ingestion Pipelines

14. Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime

15. Index

Why subscribe?

16. Other Books You May Enjoy

Applying data governance in ingestion

Data governance is a set of methodologies that ensure that data is secure, available, well-stored, documented, private, and accurate.

Getting ready

Data ingestion is the beginning of the data pipeline process, but it doesn’t mean data governance is not heavily applied. The governance status in the final data pipeline output depends on how it was implemented during the ingestion.

The following diagram shows how data ingestion is commonly conducted:

Figure 1.18 – The data ingestion process

Let’s analyze the steps in the diagram:

Getting data from the source: The first step is to define the type of data, its periodicity, where we will gather it, and why we need it.
Writing the scripts to ingest data: Based on the answers to the previous step, we can begin planning how our code will behave and some basic steps.
Storing data in a temporary database or other types of storage: Between the ingest and the transformation phase, data is typically stored in a temporary database or repository.

Figure 1.19 – Data governance pillars

How to do it…

Step by step, let’s attribute the pillars in Figure 1.19 to the ingestion phase:

A concern for accessibility needs to be applied at the data source level, defining the individuals that are allowed to see or retrieve data.
Next, it is necessary to catalog our data to understand it better. Since data ingestion is only covered here, it is more relevant to cover the data sources.
The quality pillar will be applied to the ingestion and staging area, where we control the data and keep its quality aligned with the source.
Then, let’s define ownership. We know the data source belongs to a business area or a company. However, when we ingested the data and put it in temporary or staging storage, it becomes our responsibility to maintain it.
The last pillar involves keeping data secure for the whole pipeline. Security is vital in all steps, since we may be handling private or sensitive information.

Figure 1.20 – Adding to data ingestion

How it works…

While some articles define “pillars” to create governance good practices, the best way to understand how to apply them is to understand how they are composed. As you saw in the previous How to do it… section, we attributed some items to our pipeline, and now we can understand how they are connected to the following topics:

Data accessibility: Data accessibility is how people from a group, organization, or project can see and use data. The information needs to be readily available for use. At the same time, it needs to be available for the people involved in the process. For example, sensitive data accessibility should be restricted to some people or programs. In the diagram we built, we applied it to our data sources, since we need to understand and retrieve data. For the same reason, it can be applied for temporary storage needs as well.
Data catalog: Cataloging and documenting data are essential for business and engineering teams. When we know what types of information rely on our databases or data lakes and have quick access to these documents, the action time to solve a problem becomes short.

Again, documenting our data sources can make the ingest process quicker, since we need to make a discovery every time we need to ingest data.

Data quality: Quality is constantly preoccupied with ingesting, processing, and loading data. Tracking and monitoring data’s expected income and outcome by its periodicity is essential. For example, if we expect to ingest 300 GB of data per day and suddenly it drops to 1 GB, something is very wrong and will affect the quality of our final output. Other quality parameters can be the number of columns, partitioning, and so on, which we will explore later in this book.
Ownership: Who is responsible for the data? This definition is crucial to make contact with the owner if there are problems or attribute responsibility to keep and maintain data.
Security: A concerning topic nowadays is data security. With so many regulations about data privacy, it became an obligation of data engineers and scientists to know, at least, the basics of encryption, sensitive data, and how to avoid data leaks. Even languages and libraries that are used for work need to be evaluated. That’s why this item is attributed to the three steps in Figure 1.19.

In addition to the topics we explored, a global data governance project has a vital role called a data steward, which is responsible for managing an organization’s data assets and ensuring that data is accurate, consistent, and secure. In summary, data stewardship is managing and overseeing an organization’s data assets.