You're reading from Simplifying Data Engineering and Analytics with Delta Create analytics-ready data that fuels artificial intelligence and business intelligence

Product type Paperback

Published in Jul 2022

Publisher Packt

ISBN-13 9781801814867

Length 334 pages

Edition 1st Edition

Languages

Python

Tools

Apache Spark

Concepts

Artificial Intelligence

Author (1):

Anindita Mahapatra

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1 – Introduction to Delta Lake and Data Engineering Principles

2. Chapter 1: Introduction to Data Engineering FREE CHAPTER

3. Chapter 2: Data Modeling and ETL

4. Chapter 3: Delta – The Foundation Block for Big Data

5. Section 2 – End-to-End Process of Building Delta Pipelines

6. Chapter 4: Unifying Batch and Streaming with Delta

7. Chapter 5: Data Consolidation in Delta Lake

8. Chapter 6: Solving Common Data Pattern Scenarios with Delta

9. Chapter 7: Delta for Data Warehouse Use Cases

10. Chapter 8: Handling Atypical Data Scenarios with Delta

11. Chapter 9: Delta for Reproducible Machine Learning Pipelines

12. Chapter 10: Delta for Data Products and Services

13. Section 3 – Operationalizing and Productionalizing Delta Pipelines

14. Chapter 11: Operationalizing Data and ML Pipelines

15. Chapter 12: Optimizing Cost and Performance with Delta

16. Chapter 13: Managing Your Data Journey

17. Other Books You May Enjoy

Evolution of data systems

We have been collecting data for decades. The flat file storages of the 60s led to the data warehouses of the 80s to Massively Parallel Processing (MPP) and NoSQL databases, and eventually to data lakes. New paradigms continue to be coined but it would be fair to say that most enterprise organizations have settled on some variation of a data lake:

Figure 1.8 – Evolution of big data systems

Cloud adoption continues to grow with even highly regulated industries such as healthcare and Fintech embracing the cloud for cost-effective alternatives to keep pace with innovation; otherwise, they risk being left behind. People who have used security as the reason for not going to the cloud should be reminded that all the massive data breaches that have been splashing the media in recent years have all been from on-premises setups. Cloud architectures have more scrutiny and are in some ways more governed and secure.

Rise of cloud data platforms

The data challenges remain the same. However, over time, the three major shifts in architecture offerings have been due to the introduction of the following:

Data warehouses
Hadoop heralding the start of data lakes
Cloud data platforms refining the data lake offerings

The use cases that we've been trying to solve for all three generations can be placed into three categories, as follows:

SQL-based BI Reporting
Exploratory data analytics (EDA)
ML

Data warehouses were good at handling modest volume structured data and excelled at BI Reporting use cases, but they had limited support for semi-structured data and practically no support for unstructured data. Their workloads could only support batch processing. Once ingested, the data was in a proprietary format, and they were expensive. So, older data would be dropped in favor of accommodating new data. Also, because they were running at capacity, interactive queries had to wait for ingestion workloads to finish to avoid putting strain on the system. There were no ML capabilities built into these systems.

Hadoop came with the promise of handling large volumes of data and could support all types of data, along with streaming capabilities. In theory, all the use cases were feasible. In practice, they weren't. Schema on read meant that the ingestion path was greatly simplified, and people dumped their data, but the consumption paths were more difficult. Managing the Hadoop cluster was complex, so it was a challenge to upgrade versions of software. Hive was SQL-like and was the most popular of all the Hadoop stack offerings. However, access performance was slow. So, part of the curated data was pushed into data warehouses due to their structure. This meant that data personas were left to stitch two systems that had their fair share of fragility and increased end-to-end latency.

Cloud data platforms were the next entrants who simplified the infrastructure manageability and governance aspects and delivered on the original promise of Hadoop. Extra attention was spent to prevent data lakes from turning into data swamps. The elasticity and scalability of the cloud helped contain costs and made it a worthwhile investment. Simplification efforts led to more adoption by data personas.

The following diagram summarizes the end-to-end flow of big data, along with its requirements in terms of volume, variety, and velocity. The process varies on each platform as the underlying technologies are different. The solutions have evolved from warehouses to Hadoop to cloud data platforms to help serve the three main types of use cases across different industry verticals:

Figure 1.9 – The rise of modern cloud data platforms

SQL and NoSQL systems

SQL databases were the forerunners before NoSQL databases arose, which were created with different semantics. There are several categories of NoSQL stores, and they can roughly be classified as follows:

Key-Value Stores: For example, AWS S3, and Azure Blob Storage
Big Table Stores: For example, DynamoDB, HBase, and Cassandra
Document Stores: For example, CouchDB and MongoDB
Full Text Stores: For example, Solr and Elastic (both based on Lucene)
Graph Data Stores: For example, Neo4j
In-memory Data Stores: For example, Redis and MemSQL

While relational systems honor ACID properties, NoSQL systems were designed primarily for scale and flexibility and honored BASE properties, where data consistency and integrity are not the highest concerns.

ACID properties are honored in a transaction, as follows:

Atomicity: Either the transaction succeeds or it fails.
Consistency: The logic must be correct every time.
Isolation: In a multi-tenant setup with numerous operations, proper demarcation is used to avoid collisions.
Durability: Once set, the data remains unchanged.

Use cases that contain highly structured data with predictable inputs and outputs, such as a financial system with a money transfer process where consistency is the main requirement.

BASE properties are honored, as follows:

Basically Available: The system is guaranteed to be available in the event of a failure.
Soft State: The state could change because of multi-node inconsistencies.
Eventual Consistency: All the nodes will eventually reconcile on the last state but there may be a period of inconsistency.

This applies to less structured scenarios involving changing schemas, such as a Twitter application scanning words to determine user sentiment. High availability despite failures is the main requirement.

OLTP and OLAP systems

It is useful to classify operational data versus analytical data. Data producers typically push data into Operational Data Stores (ODS). Previously, this data was sent to data warehouses for analytics. In recent times, the trend is changing to push the data into a data lake. Different consumers tap into the data at various stages of processing. Some may require a portion of the data from the data lake to be pushed to a data warehouse or a separate serving layer (which can be NoSQL or in-memory).

Online Transaction Processing (OLTP) systems are transaction-oriented, with continuous updates supporting business operations. Online Analytical Processing (OLAP) systems, on the other hand, are designed for decision support systems that are processing several ad hoc and complex queries to analyze the transactions and produce insights:

Figure 1.10 – OLTP and OLAP systems

Data platform service models

Depending on the skill set of the data team, the timelines, the capabilities, and the flexibilities being sought, a decision needs to be made regarding the right service model. The following table summarizes the model offerings and the questions you should ask to decide on the best fit:

Figure 1.11 – Service model offerings

The following table further expands on the main value proposition of each service offering, highlighting the key benefits and guidelines on when to adopt them: