Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Engineering Best Practices

You're reading from   Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781803244983
Length 550 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
David Larochelle David Larochelle
Author Profile Icon David Larochelle
David Larochelle
Richard J. Schiller Richard J. Schiller
Author Profile Icon Richard J. Schiller
Richard J. Schiller
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Chapter 1: Overview of the Business Problem Statement FREE CHAPTER 2. Chapter 2: A Data Engineer’s Journey – Background Challenges 3. Chapter 3: A Data Engineer’s Journey – IT’s Vision and Mission 4. Chapter 4: Architecture Principles 5. Chapter 5: Architecture Framework – Conceptual Architecture Best Practices 6. Chapter 6: Architecture Framework – Logical Architecture Best Practices 7. Chapter 7: Architecture Framework – Physical Architecture Best Practices 8. Chapter 8: Software Engineering Best Practice Considerations 9. Chapter 9: Key Considerations for Agile SDLC Best Practices 10. Chapter 10: Key Considerations for Quality Testing Best Practices 11. Chapter 11: Key Considerations for IT Operational Service Best Practices 12. Chapter 12: Key Considerations for Data Service Best Practices 13. Chapter 13: Key Considerations for Management Best Practices 14. Chapter 14: Key Considerations for Data Delivery Best Practices 15. Chapter 15: Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices 16. Chapter 16: Machine Learning Pipeline Best Practices and Processes 17. Chapter 17: Takeaway Summary – Putting It All Together 18. Chapter 18: Appendix and Use Cases 19. Index 20. Other Books You May Enjoy

What is the business problem statement?

Data engineering approaches are rapidly morphing today. They will coalesce into a systemic, consistent whole. At the core of this transformation is the realization that data is information that needs to represent facts and truths along with the rationalization that created those facts and truths over time. There must not be any false facts in future information systems. That term may strike you as odd. Can a fact be false? This question may be a bit provocative. But haven’t we often built IT systems to determine just that?

We process data in software systems that preserve business context and meaning but force the data to be served only through those systems. It does not stand alone and if consumed out of context, it would lead to these false facts propagating into the business environment. Data can’t stand alone today; it must be transformed by information processing systems, which have technical limitations. Pragmatic programmers’ {https://packt-debp.link/zS3jWY} imperfect tools and technology will produce imperfect solutions. Nevertheless, the engineer is still tasked with removing as many as possible, if not all, false facts when producing a solution. That has been elusive in the past.

We often take shortcuts. We also justify these shortcuts with statements like: “there simply is not enough time!” or “there’s no way we can get all that data!” The business “can’t afford to curate it correctly,” or lastly “there’s no funding for boiling the ocean.” We do not need to boil the ocean.

What we are going to think about is how we are going to turn that ocean directly into steam! This should be our response, not a rollover! This rethinking mindset is exactly what is needed as we engineer solutions that will be future-proof. What is hard is still possible if we rethink the problem fully. To turn that metaphor around – we will use data as the new fuel for the engine of innovation.

Fun fact

In 2006, mathematician Clive Humby coined the phrase “data is the new oil” {https://packt-debp.link/SiG2rL}.

Data systems must become self-healing of false facts to enable them to be knowledge-complete. After all, what is a true fact? Is it not just a hypothesis backed up by evidence until such time that future observations disprove a prior truth? Likewise, organizing information into knowledge requires not just capturing semantics, context, and time series relevance but also the asserted reason for a fact being represented as information truth within a dataset. This is what knowledge defines: truth! However, it needs correct representation.

Note

The truth of a knowledge base is composed of facts that are proven by assertions that withstand the test of time and do not hide information context that makes up the truth contained within the knowledge base.

But sometimes, when we do not have enough information, we guess. This guessing is based on intuition and prior experience with similar patterns of interconnected information from related domains. We humans can be very wrong with our guesses. But strongly intuited guesses can lead to great leaps in innovation which can later be backfilled with empirically collected data.

Until then, we often stretch the truth to span gaps of knowledge. Information relationship patterns need to be retained as well as the hypothesis recording accurate educated guesses. In this manner, data truths can be guessed. They can also be guessed well! These guesses can even be unwound when proven to be wrong. It is essential that data is organized in a new way to support intelligence. Reasoning is needed to support or refute hypotheses, and the retention of information as knowledge to form truth is essential. If we don’t address organizing big data to form knowledge and truth within a framework consumable by the business, we are just wasting cycles and funding on cloud providers.

This book will focus on best practices; there are a couple of poor practices that need to be highlighted. These form anti-patterns that have crept into the data engineer’s tool bag over time that hinder the mission we seek to be successful in. Let’s look into these anti-patterns next.

Anti-patterns to avoid

What are anti-patterns? These are architectural patterns that form blueprints for ease of implementation. Just like when building a physical building, a civil architect will use blueprints to definitively communicate expectations to the engineers. If a common solution is recurring and successful, it is reused often as a pattern, like the framing of a wall or a truss for a type of roofline. Likewise, an anti-pattern is a pattern to be avoided: like not putting plumbing on an outside wall in a cold climate, because the cold temperature could freeze those pipes.

The first anti-pattern we describe deals with retaining stuff as data that we think is valuable but can no longer even be understood or processed given how it was stored, and it’s contextual meaning gets lost since it was never captured when the data was first retained in storage (such as cloud storage).

The second anti-pattern involves not knowing the business owner’s meaning for column-formatted data, nor how those columns relate to each other to form business meaning because this meaning was only preserved in the software solution, not in the data itself. We rely on entity relationship diagrams (ERDs), that are not worth the paper they were printed on, to gain some degree of clarity that is lost the next time an agile developer does not update them. Knowing what we must avoid in the future as we develop a future-proof, data-engineered solution will help set the foundation for this book.

In order to get a better understanding of the two anti-patterns just introduced, the following specific examples should help illustrate what to avoid.

Anti-pattern #1 – Must we retain garbage?

As an example of what not to do, in the past, I examined a system that retained years of data, only to be reminded that the data was useless after three months. This is because the processing code that created that data had changed hundreds of times in prior years and continued to evolve without being noted in the dataset produced by that processing. The assumptions put into those non-mastered datasets were not preserved in the data framework. Keeping that data around was a red herring, just waiting for some future big data analyst to try and reuse it. When I asked, “Why was it even retained?” I was told it had to be, according to company policy. We are often faced with someone who thinks piles of stuff are valuable, even if they’re not processable. Some data can be the opposite of valuable. It can be a business liability if reused incorrectly. Waterfall gathered business requirements or even loads of agile development stories will not solve this problem without a solid data framework for data semantics as well as data lineage for the data’s journey from information to knowledge. Without this smart data framework, the insights gathered would be wrong!

Anti-pattern #2 – What does that column mean?

Likewise, as another not-to-do example, I once built an elaborate, colorful graphical rendering of web consumer usage across several published articles. It was truly a work of art, though I say so myself. The insight clearly illustrated that some users were just not engaging a few key classes of information that were expensive to curate. However, it was a work of pure fiction and had to be scrapped! This was because I misused one key dataset column that was loaded with data that was, in fact, the inverted rank of users access rather than an actual usage value.

During the development of the data processing system, the prior developers produced no metadata catalog, no data architecture documentation, and no self-serve textual definition of the columns. All that information was retained in the mind of one self-serving data analyst. The analyst was holding the business data hostage and pocketing huge compensation for generating insights that only that individual could produce. Any attempt to dethrone this individual was met with one key and powerful consumer of the insight overruling IT management. As a result, the implementation of desperately needed governance mandated enterprise standards for analytics was stopped. Using the data in such an environment was a walk through a technical minefield.

Organizations must avoid this scenario at all costs. It is a data-siloed, poor-practice anti-pattern. It arises due to individuals seeking to preserve a niche position or a siloed business agenda. In the case just illustrated, that anti-pattern was to kill the use of the governance-mandated enterprise standard for analytics. The problem can be protected from abuse by properly implementing governance in the data framework where data becomes self-explanatory.

Let’s consider a real-world scenario that illustrates both of these anti-patterns. A large e-commerce company has many years of customer purchase data that includes a field called customer_value. Originally, this field was calculated using the total amount the customer spent, but its meaning has changed repeatedly over the years without updates to the supporting documentation. After a few years, it was calculated as total_spending – total_returns. Later, it becomes predicted_lifetime_value based on a machine learning (ML) model. When a new data scientist joins the company and uses the field to segment customers for a marketing campaign, the results are disastrous! High value customers from early years are undervalued while new customers are overvalued! This example illustrates how retaining data without proper context (Anti-pattern #1) and lack of clear documentation for data fields (Anti-pattern #2) can lead to significant mistakes.

Patterns in the future-proof architecture

Our effort in writing this book is to strive to highlight for the data engineer the reality that in our current information technology solutions, we process data as information, when, in fact, we want to use it to inform the business knowledgably.

Today, we glue solutions together with code that manipulates data to mimic information for business consumption. What we really want to do is to retain the business information with the data and make the data smart so that information in context forms knowledge that will form insights for the data consumer. The progression of data begins with just raw data that is transformed into information, and then knowledge, through the preservation of semantics along with context; and finally, the development of analytic derived insights will be elaborated on in future chapters. In Chapter 18, we have included a number of use cases that you will find interesting. From my experience over the years, I’ve learned that making data smarter has always been rewarded.

The resulting insights may be presented to the business in new innovative manners when the business requires those insights from data. The gap we see in the technology landscape is that in order for data to be leveraged as an insight generator, its data journey must be an informed one. Innovation can’t be pre-canned by the software engineer. It is teased out of the minds of business and IT leaders from the knowledge the IT data system presents from different stages of the data journey. This requires data, its semantics, its lineage, its direct or inferred relationships to concepts, its time series, and its context to be retained.

Technology tools and data processing techniques are not yet available to address this need in a single solution, but the need is clearly envisioned. One monolithic data warehouse, data lake, knowledge graph, or in-memory repository can’t solve the total user-originated demand today. Tools need time to evolve. We will need to implement tactically and think strategically regarding what data (also known as truths) we present to the analyst.

Key thought

Implement: Just enough, just in time.

Think strategically: Data should be smart.

Applying innovative modeling approaches can bring systemic and intrinsic risk. Leveraging new technologies will produce key advantages for the business. Minimizing the risk of technical or delivery failure is essential. When thinking of the academic discussions debating data mesh versus data fabric, we see various cloud vendors and tool providers embracing the need for innovation… but also creating a new technical gravity that can suck in the misinformed business IT leader.

Remember, this is an evolutionary event and for some it can become an extinction level event. Microsoft and Amazon can embrace well architected best practices that foster greater cloud spend and greater cloud vendor lock-in. Cloud platform-as-a-service (PaaS) offerings, cloud architecture patterns, and biased vendor training can be terminal events for a system and its builders. The same goes for tool providers such as the creators of relational database management systems (RDBMS), data lakes, operational knowledge graphs, or real-time in-memory storage systems. None of the providers or their niche consulting engagements come with warning signs. As a leader trying to minimize risk and maximize gain, you need to keep an eye on the end goal:

“I want to build a data solution that no one can live without – that lasts forever!”

To accomplish this goal, you will need to be very clear on the mission and retain a clear vision going forward. With a well-developed set of principles, best practices, and clear position regarding key considerations, with an unchallenged governance model … the objective is attainable. Be prepared for battle! The field is always evolving and there will be challenges to the architecture over time, maybe before it is even operational. Our suggestion is to always be ready for these challenges and do not count on political power alone to enforce compliance or governance of the architecture.

You will want to consider these steps when building a modern system:

  • Collect the objectives and key results (OKRs) from the business and show successes early and often.
  • Always have a demo ready for key stakeholders at a moment’s notice for key stakeholders.
  • Keep those key stakeholders engaged and satisfied as the return on investment (ROI) is demonstrated. Also, consider that they are funding your effort.
  • Keep careful track of the feature to cost ratio and know who is getting value and at what cost as part of the system’s total cost of ownership (TCO).
  • Never break a data service level agreement (SLA) or data contract without giving the stakeholders and users enough time to accommodate impacts. It’s best to not break the agreement at all, since it clearly defines the data consumer’s expectations!
  • Architect data systems that are backwardcompatible and never produce a broken contract once the business has engaged the system to glean insight. Pulling the rug out from under the business will have more impact than not delivering a solution in the first place, since they will have set up their downstream expectations based on your delivery.

You can see that there are many patterns to consider and some to avoid when building a modern data solution. Software engineers, data admins, data scientists, and data analysts will come with their perspectives and technical requirements in addition to objectives and key results (OKRs) that the business will demand. Not all technical players will honor the nuances that their peers’ disciplines require. Yet, the data engineer has to deliver the future-proof solution while balancing on top of a pyramid change.

In the next section, we will show you how to keep the technological edge and retain the balance necessary to create a solution that withstands the test of time.

You have been reading a chapter from
Data Engineering Best Practices
Published in: Oct 2024
Publisher: Packt
ISBN-13: 9781803244983
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime