Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Ingestion with Python Cookbook

You're reading from  Data Ingestion with Python Cookbook

Product type Book
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Pages 414 pages
Edition 1st Edition
Languages
Author (1):
Gláucia Esppenchutz Gláucia Esppenchutz
Profile icon Gláucia Esppenchutz
Toc

Table of Contents (17) Chapters close

Preface 1. Part 1: Fundamentals of Data Ingestion
2. Chapter 1: Introduction to Data Ingestion 3. Chapter 2: Principals of Data Access – Accessing Your Data 4. Chapter 3: Data Discovery – Understanding Our Data before Ingesting It 5. Chapter 4: Reading CSV and JSON Files and Solving Problems 6. Chapter 5: Ingesting Data from Structured and Unstructured Databases 7. Chapter 6: Using PySpark with Defined and Non-Defined Schemas 8. Chapter 7: Ingesting Analytical Data 9. Part 2: Structuring the Ingestion Pipeline
10. Chapter 8: Designing Monitored Data Workflows 11. Chapter 9: Putting Everything Together with Airflow 12. Chapter 10: Logging and Monitoring Your Data Ingest in Airflow 13. Chapter 11: Automating Your Data Ingestion Pipelines 14. Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime 15. Index 16. Other Books You May Enjoy

Applying data governance in ingestion

Data governance is a set of methodologies that ensure that data is secure, available, well-stored, documented, private, and accurate.

Getting ready

Data ingestion is the beginning of the data pipeline process, but it doesn’t mean data governance is not heavily applied. The governance status in the final data pipeline output depends on how it was implemented during the ingestion.

The following diagram shows how data ingestion is commonly conducted:

Figure 1.18 – The data ingestion process

Figure 1.18 – The data ingestion process

Let’s analyze the steps in the diagram:

  1. Getting data from the source: The first step is to define the type of data, its periodicity, where we will gather it, and why we need it.
  2. Writing the scripts to ingest data: Based on the answers to the previous step, we can begin planning how our code will behave and some basic steps.
  3. Storing data in a temporary database or other types of storage: Between the ingest and the transformation phase, data is typically stored in a temporary database or repository.
Figure 1.19 – Data governance pillars

Figure 1.19 – Data governance pillars

How to do it…

Step by step, let’s attribute the pillars in Figure 1.19 to the ingestion phase:

  1. A concern for accessibility needs to be applied at the data source level, defining the individuals that are allowed to see or retrieve data.
  2. Next, it is necessary to catalog our data to understand it better. Since data ingestion is only covered here, it is more relevant to cover the data sources.
  3. The quality pillar will be applied to the ingestion and staging area, where we control the data and keep its quality aligned with the source.
  4. Then, let’s define ownership. We know the data source belongs to a business area or a company. However, when we ingested the data and put it in temporary or staging storage, it becomes our responsibility to maintain it.
  5. The last pillar involves keeping data secure for the whole pipeline. Security is vital in all steps, since we may be handling private or sensitive information.
Figure 1.20 – Adding to data ingestion

Figure 1.20 – Adding to data ingestion

How it works…

While some articles define “pillars” to create governance good practices, the best way to understand how to apply them is to understand how they are composed. As you saw in the previous How to do it… section, we attributed some items to our pipeline, and now we can understand how they are connected to the following topics:

  • Data accessibility: Data accessibility is how people from a group, organization, or project can see and use data. The information needs to be readily available for use. At the same time, it needs to be available for the people involved in the process. For example, sensitive data accessibility should be restricted to some people or programs. In the diagram we built, we applied it to our data sources, since we need to understand and retrieve data. For the same reason, it can be applied for temporary storage needs as well.
  • Data catalog: Cataloging and documenting data are essential for business and engineering teams. When we know what types of information rely on our databases or data lakes and have quick access to these documents, the action time to solve a problem becomes short.

Again, documenting our data sources can make the ingest process quicker, since we need to make a discovery every time we need to ingest data.

  • Data quality: Quality is constantly preoccupied with ingesting, processing, and loading data. Tracking and monitoring data’s expected income and outcome by its periodicity is essential. For example, if we expect to ingest 300 GB of data per day and suddenly it drops to 1 GB, something is very wrong and will affect the quality of our final output. Other quality parameters can be the number of columns, partitioning, and so on, which we will explore later in this book.
  • Ownership: Who is responsible for the data? This definition is crucial to make contact with the owner if there are problems or attribute responsibility to keep and maintain data.
  • Security: A concerning topic nowadays is data security. With so many regulations about data privacy, it became an obligation of data engineers and scientists to know, at least, the basics of encryption, sensitive data, and how to avoid data leaks. Even languages and libraries that are used for work need to be evaluated. That’s why this item is attributed to the three steps in Figure 1.19.

In addition to the topics we explored, a global data governance project has a vital role called a data steward, which is responsible for managing an organization’s data assets and ensuring that data is accurate, consistent, and secure. In summary, data stewardship is managing and overseeing an organization’s data assets.

See also

You can read more about a recent vulnerability found in one of the most used tools for data engineering here: https://www.ncsc.gov.uk/information/log4j-vulnerability-what-everyone-needs-to-know.

You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023 Publisher: Packt ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime