Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Ingestion with Python Cookbook

You're reading from   Data Ingestion with Python Cookbook A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

Arrow left icon
Product type Paperback
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Length 414 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Gláucia Esppenchutz Gláucia Esppenchutz
Author Profile Icon Gláucia Esppenchutz
Gláucia Esppenchutz
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Part 1: Fundamentals of Data Ingestion
2. Chapter 1: Introduction to Data Ingestion FREE CHAPTER 3. Chapter 2: Principals of Data Access – Accessing Your Data 4. Chapter 3: Data Discovery – Understanding Our Data before Ingesting It 5. Chapter 4: Reading CSV and JSON Files and Solving Problems 6. Chapter 5: Ingesting Data from Structured and Unstructured Databases 7. Chapter 6: Using PySpark with Defined and Non-Defined Schemas 8. Chapter 7: Ingesting Analytical Data 9. Part 2: Structuring the Ingestion Pipeline
10. Chapter 8: Designing Monitored Data Workflows 11. Chapter 9: Putting Everything Together with Airflow 12. Chapter 10: Logging and Monitoring Your Data Ingest in Airflow 13. Chapter 11: Automating Your Data Ingestion Pipelines 14. Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime 15. Index 16. Other Books You May Enjoy

Implementing data replication

Data replication is a process applied in data environments to create multiple copies of data and store them on different locations, servers, or sites. This technique is commonly implemented to create better availability and avoid data loss if there is downtime, or even a natural disaster that affects a data center.

Getting ready

You will find across papers and articles different types (or even names) on the best way for data replication decision. In this recipe, you will learn how to decide which kind of replication better suits your application or software.

How to do it…

Let’s begin to build our fundamental pillars to implement data replication:

  1. First, we need to decide the size of our replication, and it can be done using a portion or all the stored data.
  2. The next step is to consider when replication will take place. It can be done synchronously when new data arrives in storage or within a specific timeframe.
  3. The last fundamental pillar is whether the data is incremented or in a bulk form.

In the end, we will have a diagram that looks like the following:

Figure 1.21 – A data replication model decision diagram

Figure 1.21 – A data replication model decision diagram

How it works…

Analyzing the preceding figure, we have three main questions to answer, regarding the extension, the frequency, and whether our replication will be incremental or bulk.

For the first question, we decide whether the replication will be complete or partial. In other words, either the data will consistently be replicated no matter what type of transaction or change was made, or just a portion of the data will be replicated. A real example of this would be keeping track of all store sales or just the most expensive ones.

The second question, related to the frequency, is to decide when a replication needs to be done. This question also needs to take into consideration related costs. Real-time replication is often more expensive, but the synchronicity guarantees almost no data inconsistency.

Lastly, it is relevant to consider how data will be transported to the replication site. In most cases, a scheduler with a script can replicate small data batches and reduce transportation costs. However, a bulk replication can be used in the data ingestion process, such as copying all the current batch’s raw data from a source to cold storage.

There’s more…

One method of data replication that has seen an increase in use in the past few years is cold storage, which is used to retain data used infrequently or is even inactive. The costs related to this type of replication are meager and guarantee data longevity. You can find cold storage solutions in all cloud providers, such as Amazon Glacier, Azure Cool Blob, and Google Cloud Storage Nearline.

Besides replication, regulatory compliance such as General Data Protection Regulation (GDPR) laws benefit from this type of storage, since, for some case scenarios, users’ data need to be kept for some years.

In this chapter, we explored the basic concepts and laid the foundation for the following chapters and recipes in this book. We started with a Python installation, prepared our Docker containers, and saw data governance and replication concepts. You will observe over the upcoming chapters that almost all topics interconnect, and you will understand the relevance of understanding them at the beginning of the ETL process.

You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023
Publisher: Packt
ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image