You're reading from Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Product type Paperback

Published in Oct 2024

Publisher Packt

ISBN-13 9781803244983

Length 550 pages

Edition 1st Edition

Languages

SQL

Tools

Google cloud SQL

Concepts

Data Engineering

Authors (2):

David Larochelle

Richard J. Schiller

View More author details

Table of Contents (21) Chapters

Preface

1. Chapter 1: Overview of the Business Problem Statement

2. Chapter 2: A Data Engineer’s Journey – Background Challenges FREE CHAPTER

3. Chapter 3: A Data Engineer’s Journey – IT’s Vision and Mission

4. Chapter 4: Architecture Principles

5. Chapter 5: Architecture Framework – Conceptual Architecture Best Practices

6. Chapter 6: Architecture Framework – Logical Architecture Best Practices

7. Chapter 7: Architecture Framework – Physical Architecture Best Practices

8. Chapter 8: Software Engineering Best Practice Considerations

9. Chapter 9: Key Considerations for Agile SDLC Best Practices

10. Chapter 10: Key Considerations for Quality Testing Best Practices

11. Chapter 11: Key Considerations for IT Operational Service Best Practices

12. Chapter 12: Key Considerations for Data Service Best Practices

13. Chapter 13: Key Considerations for Management Best Practices

14. Chapter 14: Key Considerations for Data Delivery Best Practices

15. Chapter 15: Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices

16. Chapter 16: Machine Learning Pipeline Best Practices and Processes

17. Chapter 17: Takeaway Summary – Putting It All Together

18. Chapter 18: Appendix and Use Cases

19. Index

Why subscribe?

20. Other Books You May Enjoy

Future-proofing is …

To future-proof a solution means to create a solution that is relevant to the present, scalable, and cost-effective, and will still be relevant in the future. This goal is attainable with a constant focus on building out a reference architecture with best practices and design patterns.

The goal is as follows:

Develop a scalable, affordable IT strategy, architecture, and design that leads to the creation of a future-proof data processing system.

When faced with the preceding goal, you have to consider that change is evolutionary rather than revolutionary. That means that a data architecture is solid and future-proof. Making a system 100% future-proof is an illusion; however, the goal of attaining a near future-proof system must always remain a prime driver of your core principles.

The attraction of shiny lights must never become bait to catch an IT system manager in a web of errors, even though cool technology may attract a lot of venture and seed capital or even create a star on one’s curriculum vitae (CV). It may just as well all fade away after a breakthrough in a niche area is achieved by a disrupter. Just look at what happened when OpenAI, ChatGPT, and related large language model (LLM) technology started to roll out. Conversational artificial intelligence (AI) has changed many systems already.

After innovation rollout, what was once hard is now easy and often available in open source to become commoditized. Even if a business software method or process-oriented intellectual property (IP) is locked away with patent protection… after some time – 10, 15, or 20 years – it is also free for reuse. In the filing disclosure of the IP, valuable insights are also made available to the competition. There can only be so many cutting-edge tech winners, and brilliant minds tend to develop against the same problem at the same time until a breakthrough is attained, often creating similar approaches. It is at this stage that data engineering is nearing an inflection point.

There will always be many more losers than winners. Depending on the size of an organization’s budget and its culture for risk/reward, there can arise a shiny light idea that becomes a blazing star. 90% of those who pursue the shooting star wind up developing a dud that fades away along with an entire IT budget. Our suggestion is to follow the business’s money and develop agilely to minimize the risk of IT-driven failure.

International Data Corporation (IDC) and the business intelligence organization Qlik came up with the following comparison:

“Data is the new water.”

You can say that data is oil or it is water – a great idea is getting twisted and repurposed, even in these statements. It’s essential that data becomes information and that information is rendered in such a way as to create direct, inferred, and derived knowledge. Truth needs to be defined as knowledge in context, including time. We need systems to be not data processing systems but knowledge aware systems that support intelligence, insight, and development of truths that withstand the test of time. In that way, a system may be future-proof. Data is too murky, like dirty water. It’s clouded by the following:

Nonsense structures developed to support current machine insufficiency
Errors due to misunderstanding of the data meaning and lineage
Deliberate opacity due to privacy and security
Missing context or state due to missing metadata
Missing semantics due to complex relationships not being recorded because of missing data and a lack of funding to properly model the data for the domain in which it was collected

Data life cycle processes and costs are often not considered fully. Business use cases drive what is important (note: we will elaborate a lot more on how use cases are represented by conceptual, logical, and physical architectures in Chapters 5-7 of this book). Use cases are often not identified early enough. The data services that were implemented as part of the solution are often left undocumented. They are neither communicated well nor maintained well over the data’s timeframe of relevancy. The result is that the data’s quality melts down like a sugar cube left in the rain. It loses its worth organically as its value degrades in time. Data efficacy loses value over time. This may be accelerated by the business and technical contracts not being maintained, and without that maintenance comes the loss of trust in a dataset’s governance. The resulting friction between business silos becomes palpable. A potential solution has been to create business data services with data contracts. These contracts are defined by well-maintained metadata, and describe the dataset at rest (its semantics) as well as its origin (its lineage) and security methods. They also include software service contracts for the timely maintenance of the subscribed quality metrics.

Businesses need to enable datasets to be priced, enhanced as value-added sets, and even sold to the highest bidder. This is driven over time by the cost of maintaining data systems, which can only increase. The data’s relevance (correctness), submitted for value-added enrichment and re-integration into commoditized data exchanges, is a key objective:

Don’t move data; enrich it in place along with its metadata to preserve semantics and lineage!

The highest bidder builds on the data according to the framework architecture and preserves the semantic domain for which the data system was modeled. Like a ratchet that never loses its grip, datasets need to be correct and hold on to the grip of reality over time. This reality for which the dataset was created can be proposed by the value-added resellers without sacrificing the quality or data service level.

Observe that, over time, the cost of maintaining data correctness, context, and relevance will exceed any single organization’s ability to sustain it for a domain. Naturally, it remains instinctual for the IT leader to hold on to the data and produce a silo. This natural reality to hide imperfections for an established system that is literally melting down must be fixed in the future data architecture’s approach. Allowing the data to evolve/drift, be value-added, and yet remain correct and maintainable is essential. Imperfect alignment of facts, assertions, and other modeled relationships within a domain would be diminished with this approach.

Too often in today’s processing systems, the data is curated to the point where it is considered good enough for now. Yet, it is not good enough for future repurposing. It carries all the assumptions, gaps, fragments, and partial data implementations that made it just good enough. If the data is right and self-explanatory, its data service code is simpler. The system solution is engineered to be elegant. It is built to withstand the pressure of change since the data organization was designed to evolve and remain 100% correct for the business domain.

“There is never enough time or money to get it right… the first time! There is always time to get it right later… again and again!’”

This pragmatic approach can stop the IT leader’s search for a better data engineering framework. Best practices could become a bother since the solution just works, and we don’t want to fix what works. However, you must get real regarding the current tooling choices available. The cost to implement any solution must be a right fit, yet as part of the architecture due diligence process, you still need to push against the edge of technology to seize on innovation opportunities, when they are ripe for the taking.

Consider semantic graph technology in OWL-RDF and its modeling and validation complexities via SPARQL, compared to using the labeled property graphs with custom code for the semantic representation of data in a subject domain’s knowledge base. Both have advantages and disadvantages; however, neither scales without implementing a change-data-capture mechanism syncing an in-memory analytics storage area for real time analytics use case support. Cloud technology has not kept up with making a one-size-fits all, data store, data lake, or data warehouse. It’s better said that one technology solution to fit all use cases and operational service requirements does not exist.

Since one size does not fit all, one data representation does not fit all use cases.

A monolithic data lake, Delta Lake, raw data storage, or data warehouse does not fit the business needs. Logical segmentation and often physical segmentation of data are needed to create the right-sized solution needed to support required use cases. The data engineer has to balance cost, security, performance, scale, and reliability requirements, as well as provider limitations. Just as one shoe size does not fit all… the solution has to be implementable and remain functional over time.

Organization into zone considerations

One facet of the data engineering best practices presented in this book is the need for a primary form of data representation for important data patterns. A raw ingest zone is envisioned to hold input Internet of Things (IoT) data, raw retailer point-of-sale data, chemical property reference data, or web analytics usage data. We are proposing that the concept of the zone be a formalization of the layers set forth in the Databricks Medallion Architecture (https://www.databricks.com/glossary/medallion-architecture). It may be worth reading through the structure of that architecture pattern or waiting until you get a chance to read Chapter 6, where a more detailed explanation is provided.

Raw data may need data profiling systems processing applied as part of ingest processing, but that is to make sure that any input data is not rejected due to syntactic or semantic incorrectness. This profiled data may even be normalized in a basic manner prior to the next stage of processing in the data pipeline journey. Its transformation involves processing into the bronze zone and later into the silver zone, then the gold zone, and finally made ready for the consumption zone (for real-time, self-serve analytics use cases).

The bronze, silver, and gold zones host information of varying classes. The gold zone data organization looks a lot like a classic data warehouse, and the bronze zone looks like a data lake, with the silver zone being a cache enabled data lake with a lot of derived, imputed, and inferred data drawn from processing data in the bronze zone. This silver zone data supports online transaction processing (OLTP) use cases but stores processed outputs in the gold zone. The gold zone may also support OLTP use cases directly against information.

The consumption zone is enabled to provide for the measures, calculated metrics, and online analytic processing (OLAP) needs of the user. Keeping it all in sync can become a nightmare of complexity without a clear framework and best practices to keep the system correct. Just think about the loss of linear dataflow control in an AWS or Azure cloud PaaS solution required to implement this zone blueprint. Without a clear architecture, data framework, best practices, and governance… be prepared for many trials and errors.

Cloud limitations

Data engineering best practices must take into consideration current cloud provider limitations and constraints that drive cost for data movement and third-party tool deployment for analytics when architecting. Consider the ultimate: a zettabyte cube of memory with sub-millisecond access for terabytes of data, where compute code resides with data to support relationships in a massive fabric or mesh. Impossible, today! But wait… maybe tomorrow this will be reality. Meanwhile, how do you build today in order to effortlessly move to that vision in the future? This is the focus of the best practices of this book. All trends point to the eventual creation of big-data, AI enabled data systems.

There are some key trends and concepts forming as part of that vision. Data sharing, confidential computing, and concepts such as bring your algorithm to the data must be considered as core approaches to repurposing datasets, their semantics, and their business value as data leaves the enterprise and enters the publicly available domain. With the data, information, and derived knowledge comes the data consumption handle, which is more than loosely defined metadata. It consists of the lineage, semantics, context, and timely value needed to sustain trust so that monetary compensation for the stream of value-added information will be possible. These royalty contracts make data resellable and demystified. Just like a book is published, so can data be published. The best practices of this book will position data to support the development of value-added services over curated information in the course of time. Smart data becomes a value-added ecosystem in and of itself, which is as important as the software data processing systems of past generations, where data was but a snapshot of the processing state for which that system was created.

The Intelligence Age

Future state data-as-a-service (DaaS) offerings will depend on the new data engineering framework that this book will highlight. Also, we will show the best practice considerations for the development framework required by that structure. The process of curating knowledge from data, and its metadata into information and then insights, will involve the process of transforming data into truths that withstand the test of time. This novel data framework required for the commoditization of information is essential as we enter the Intelligence Age and exit the Information Age. In the Intelligence Age, insights are gathered from knowledge formed from information operated upon by human and AI systems.

Achieving AI goals requires the application of various machine and deep learning algorithms that define the Intelligence Age. Along with these algorithms, you can envision the development of extremely advanced quantum computers and zettabyte in-memory storage arrays. These will be needed as part of the new data engineering organization. What is often not discussed is the data engineering framework required to facilitate the algorithms; otherwise, the data lake will become a data swamp in short order. The power of computing will advance in leaps and bounds. What we build today in software systems paves the way for those hardware advances – not the reverse. Software algorithms drive the hardware computing power needed; otherwise, the hardware remains underutilized. Data engineering has been a laggard in the evolutionary process.

For this purpose, this book has been created to future-proof data engineering architectures and designs by providing best practices with considerations for senior IT leadership thinking. Along with that come some practical data architecture approaches with use case justifications for the data architect and data engineer; these will add color to those best practice descriptions.

The rest of the chapter is locked

You're reading from Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Table of Contents (21) Chapters

Future-proofing is …

Organization into zone considerations

Cloud limitations

The Intelligence Age

Unlock this book and the full library FREE for 7 days

Authors (2)

Personalised recommendations for you