Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Engineering Best Practices

You're reading from   Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781803244983
Length 550 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
David Larochelle David Larochelle
Author Profile Icon David Larochelle
David Larochelle
Richard J. Schiller Richard J. Schiller
Author Profile Icon Richard J. Schiller
Richard J. Schiller
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Chapter 1: Overview of the Business Problem Statement FREE CHAPTER 2. Chapter 2: A Data Engineer’s Journey – Background Challenges 3. Chapter 3: A Data Engineer’s Journey – IT’s Vision and Mission 4. Chapter 4: Architecture Principles 5. Chapter 5: Architecture Framework – Conceptual Architecture Best Practices 6. Chapter 6: Architecture Framework – Logical Architecture Best Practices 7. Chapter 7: Architecture Framework – Physical Architecture Best Practices 8. Chapter 8: Software Engineering Best Practice Considerations 9. Chapter 9: Key Considerations for Agile SDLC Best Practices 10. Chapter 10: Key Considerations for Quality Testing Best Practices 11. Chapter 11: Key Considerations for IT Operational Service Best Practices 12. Chapter 12: Key Considerations for Data Service Best Practices 13. Chapter 13: Key Considerations for Management Best Practices 14. Chapter 14: Key Considerations for Data Delivery Best Practices 15. Chapter 15: Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices 16. Chapter 16: Machine Learning Pipeline Best Practices and Processes 17. Chapter 17: Takeaway Summary – Putting It All Together 18. Chapter 18: Appendix and Use Cases 19. Index 20. Other Books You May Enjoy

Architecture principles in depth

The following principles will help guide your decisions when architecting and engineering solutions are built into a modern data platform.

Principle #1 – Data lake as a centerpiece? No, implement the data journey!

This may sound shocking and really an anti-follow-the-herd mentality, but it is true! Thinking that the data lake was envisioned as a source for all data that can be miraculously understood and repurposed over time leading to great insights is naïve. It can become a data swamp and a costly liability without semantics, context, time series structures, and a clear metadata pattern with governance principles aligned with the data mesh and operational data fabric capabilities.

Data needs to be curated in the factory from raw form to consumable form, and it needs structure and life cycle along its assembled journey through various zones (such as a number of logical data lakes) until ready for consumption. Data needs to be released like a product. You cannot just wave a magic wand over the data to get it to this state of readiness for the consumer. That data journey is worth noting, logging, preserving, contracting, securing, controlling with entitlements, and assigning value-added rights for long term sustainability.

Principle #2 – A data lake’s immutable data is to remain explorable

A logical dataset should first undergo its contracted life cycle’s curation process steps (as defined as assembly instructions) along with quality metrics within a zone. Logical data is then ready for propagation to the next stage in the data journey or made ready for placement into the downstream zone. Datasets will remain explorable and cataloged at the zone boundaries since that is where contracts are enforced. Datasets will need to have lineage recorded along the journey and then have the lineage condensed (or rolled up) once the dataset is readied for release. It is important that data catalogs have the capability to separate temporary lineage trails from final trails that must be retained with the released dataset. Many tools today do not have these parallel metadata trails available nor a way to roll up the trails to the release set. This is even more complex in a cloud environment where many platform as a service (PaaS) offerings are utilized along the data journey. Cloud metadata services provide tracking services (such as Azure Purview), but these need to be augmented to correctly implement this principle. Once metadata for the data journey is available, the data should be made explorable. Without the data and its metadata being linked together and made available for search systems, you cannot effectively identify a dataset’s worth, its context, or fitness for use when being leveraged by a data engineer or consumer. Datasets and their metadata and search indices will need to be locked down and readiness state be a primary search facet when looking for data across the entire data factory platform-enabled system. It is far too easy to lose track of a widget on a physical factory floor, and this is true of a dataset element in whatever state it may be, across the entirety of the data factory.

Principle #3 – A data lake’s immutable data remains available for analytics

After all, code is code, and it has bugs that need to be addressed in time.

Errors will exist in data, metadata, second-order and third-order data/metadata, and various quality processing steps. Data states will be affected every time a code snippet is adjusted, and with that, the history of the change and its effect on prior processed data is subject to auditing and downstream explanation. Analytics snippets are applied throughout the data factory’s data pipelines. These are part of the shift-left testing and TFD methods that must be core to the software design approaches applied throughout the software development life cycle (SDLC). The analytics steps will need to be self-contained and cataloged. Also, the analytics output needs to be clearly cataloged as metadata for primary, second-order, and third-order data of the pipeline.

Data quality analytics, trend assessments, and consumer analytics will indicate that there may be errors or deviations to the contracted dataset’s state. What do you do with corrective algorithms? In a prior system, corrective algorithms piled up change on change for a curated dataset so high that the truth of the original dataset was lost. How data corrective steps are applied must not change the immutable source but be applied as data lenses on the source data so that if they were removed – the original could still be leveraged. If necessary, all value added lenses should be rebuilt when gross errors are encountered.

Some data lenses are temporary in that they will apply only for a limited release or duration of time to correct a gross gap, understatement, or overstatement of a dataset’s raw availability. These also need to be cataloged and auto-removed when no longer pertinent (such as for the next data release).

Data analytics is not an art but a science and when engineered into a data factory will result in the need to explain the state or quality of curated datasets. This will require change when the explanation is: That’s an error! When errors arise, new data corrective steps are added that need cataloging and life cycle management as data lenses rather than changes to the immutable raw or curated data.

Principle #4 – A data lake’s sources are discoverable

A dataset’s source will need to be clear when leveraging the data catalog’s search capabilities with the data lineage trail, and all steps, stages, and corrections evident. This gets very complex and can look like a tree’s roots seen from a bird’s eye view. There are lots of branches, merges, algorithm processing nodes, and ultimately, way down deep, you will find the row and column of the source. When tracking data lineage metadata, it is important that the journey be backward discoverable. This is not easy since a processing step may have implemented a nested update such as Change a column on a table using a transform after a join that first merged a lookup table with a conditional. What caused a particular row/column change was just lost. Many metadata tracking systems just report the query and leave it to the forensic analysts to figure out if the transform was effective. Worse yet, when the update join operation was performed, it may be that the lookup table itself was being updated in another parallel processing step at just the time the join was being made.

This really happened in a prior experience, and the audit trace looked like a perfectly executed operation when it was really producing a gross error since the joined lookup table was being updated at that exact moment. You could have fixed this with a global system table lock but that would have slowed the entire factory down. The update was being performed in an optimized service that was out of the development team’s control. A snapshot of the table followed by a quality check was implemented to fix the problem, and then the snapshot persisted for backward traceability once the pipeline was completed.

It is important that you do not forget the need to assess a data source’s original source when looking backward from the end state. Being able to get past a data pipeline processing black-box stage by making sure they have a gray box level of inspection trace will enable the state of the original data source to be discoverable.

Principle #5 – A data lake’s tooling should be consistent with the architecture

Many third-party tools and cloud services are over-marketed to data engineering management. These tools and services are often only 80% production ready, even though they have been launched as generally available. They come with the need for excessive training since they exhibit peculiar error-handling characteristics that will appear only at the edge of performance, scale, or functional use! A proof of concept (POC) may never expose these characteristics. Only a data engineer with a bulletproof quality standard with clear methodology will be able to smoke out these bugs (and they are not features) yet a developer must code around the handling of these anomalies. Sometimes the workarounds do not warrant the use of the tool, and this can come to light too late in the project and force re-solutioning of the pipeline. This is not refactoring. It’s a disaster of architecture and design caused by believing in a vendor’s marketware rather than proven software.

The need to assess tools and services correctly with an objective due diligence process, with a clear demonstrable winner with the optimum fit for the architecture, must precede the POC to vet vendor fit. Once deemed fit, a POC should go forward to vet all integration needs, and then finally the iterative agile development of the solution.

All too often, the industry herd moves to the next shiny thing that is offered by a start-up vendor. The result is a fragile mess of a solution that exposes the business to risk. The architecture is implemented based on principles and required capabilities. It stands above the niche vendor’s capabilities. It should be used to guide and align any selected vendor capabilities and reject any misaligned capabilities and marketware.

Principle #6 – A data mesh defines data to be governed by domain-driven ownership

We are suggesting that datasets in the data mesh be curated by data pipeline processes with contracts set by domain owners. This will mean that datasets stand alone and do not cross domain boundaries. This is not always the case since data can be adjusted with downstream value-added additions. Ownership becomes shared between the primary domain owner and the value-added owner. The contracts for subsequent downstream consumption are then a mixture between the primary owner and a chain of value-added owners.

With ownership comes rights for entitlements, fair use, and commoditization with or without redistribution rights for the consumable dataset. You can even go so far as to put all contracts and lineage into a blockchain, but that capability has yet to mature into easily integrateable third-party offerings. However, it is still required to maintain contracts and make them discoverable in a data catalog along with the metadata for any data at rest with lineage as well as all value-added adjustments.

The domain owner’s data change life cycle and potential for contract default exists, and as such, trust can be eroded over time. Who has not used data thinking it would be refreshed yearly only to find out that it was only partially refreshed due to the business failure of the domain owner? Departments in a large organization come and go, ownership changes, funding stops, or management is shaken up. The data consumer may never be notified of contract breaches before they impact production. It is required that contracts be subject to compliance tests and predictive alerting enabled before failure arises. This way, the consumer has time to react. If data is domain owned and a product, it must be treated as such.

The total cost for the domain owner must consider the intangible revenue and costs to the contracted consumers. The proper data-engineered solution must provide that dashboard to the financial accountants, who often do not know that the domain owner’s business data obligations impact consumers. The changes in a domain owner’s funding or viability have ripple effects on the business. Centralizing domain ownership into a core IT group is often the solution for orphaned domain owners’ data products. Over time, the data contract changes since a steward has replaced the domain ownership. Consumers need to see this in the changed contract when the architecture refreshes yearly and tech debt is re-assessed. Additional data debt remediation costs are incurred implicitly, which need to be added to the strategic plan’s rolling 5- (or 10-) year total cost estimate as the future state horizon moves out in time.

There are many practical data engineering effects on domain ownership of data in the data mesh. These need to be discussed and planned for since what is a great idea on day one must meet the practicalities of the long-term run/managed solution. That solution is developed in the context of a fast paced dynamic business environment.

Principle #7 – A data mesh defines the data and derives insights as a product

A dataset (with cataloged metadata), its derived data, its quality analytics process steps, its transforms, its corrections/adjustments, and various versions as it appears in the data pipeline’s zones are considered products in the data mesh. As a product, a dataset comes with all the implications of it being a product as defined by Zhamak Dehghani in her work on the data mesh concept while at Starburst {https://packt-debp.link/77sMny}. What you also have to know is that any directly observed insights as well as implicit insights/inferred insights are also products. Imagine acting on a dataset’s insight today only to have it change tomorrow or next month because of the dataset’s restatement, a software bug correction fixed in reprocessing, or a trend break that was standard for years. It could spell financial disaster for a consumer. Heads will roll if the impact is not assessed and the change provisions of the data contract are not communicated.

We can’t begin to tell you how much time was spent in the past explaining data changes and why a data change was made going forward or, worse yet, in the past. Products can be recalled and so can data. You must be ready for reality. Data and insights do not just get profitably produced but can be recalled and restated. You must handle the ugly parts of treating data as a product and not just build anticipation for the revenue that the thought evokes. There are real costs to maintaining data contracts, and some of these are practical corrective costs for restatements, parallel version maintenance, differential sets (for change data capture (CDC) log replays), communications of data inventories in stock, quality metrics, and change logs (dataset inventories that could even be snapshots of the data catalog at the time of distribution/consumption).

We caution the reader to look at all aspects of data being a product and not skimp on any capability that puts the data contract at risk or the dataset’s quality into an unknown state.

Principle #8 – A data mesh defines data, information, and insights to be self-service

We have observed that data being self-service is a great concept but rarely do analysts want to spend an enormous time curating it from its raw state. Consider the bricklayer working with clay and sand and cement rather than bricks and mortar when building a wall. It would be insane not to build with some degree of prebuilt material. Items must fit together seamlessly rather than having to be manufactured on demand at the time of assembly to become an insight. Data, therefore, must have edges or facets that align with other data. You can’t have a year be defined as a calendar year when the data is aligned with hundreds of different company corporate years. Retail calendars have to align for weekly, quarterly, and yearly data to be analyzed comparably. Datasets subject to be used as self-service need to be faceted, subject to contracts, and fully discoverable in the data catalog with lineage explaining contracts. Only trusted clear data can be put out as self-service. Any difficult analysis that requires internal data facets to be exposed must be wrapped as second- and third-order data that then becomes subject to self-service.

Once any complex data is too hard to explain, it can’t be made available self-service! So, to solve this, you can provide open services and analytics to put into code and scripts to interpret complex data. These analytic wrappers and pre-canned services help solve the complexity problem while leading to data maintenance issues. Analytic code changes in Microsoft Power BI or a big data notebook can be made incorrectly, leading to unchecked data abuse. Financial information providers are acutely aware of data abuse issues in analytics. The financial numbers should tell a story but even a small gap in a time series can lead to analytics errors.

For self-service data mesh goals to be truly effective, self-service data must be correct, complete, timely, high quality, and at all times aligned with the data contract. But what if the data has gaps or is missing key fields? Can it be augmented with synthetic data, and if so, how can that be good regarding maintaining the data contract?

It all depends on how the data contract is written. If data must be 100% complete and the raw data is not, does this practical and real lack of some data points hold up the assembly of the final product? The answer is, “No!” Can the contract enable the factory pipeline to produce data points in a dataset that are fiction just to maintain a trend or fill a gap or subject points that are redundant? The answer is, “Yes!” The data’s use case is the important goal. The purity of the dataset as a whole has to meet the terms of the data contract to apply to the entire dataset, not just individual data points in the set.

Principle #9 – A data mesh implements a federated governance processing system

Federated governance has an implication. There is a central governance function, and some governance can be delegated to others. It is a fine balance that must be maintained between the two major divisions of data governance: central versus distributed. The effect is an ability to change and adapt as data and contracts are brought online in the data mesh. The enterprise’s organization must be made ready to handle this model of governance. Also, the organization must promote the federated model with standards from the highest level (also known as the core architecture) and incentivize compliance.

Martin Fowler {https://packt-debp.link/o4dhRZ} points out that centralized golden datasets are no longer pertinent. You have to comply with the data mesh federated principles and established architecture while maintaining data contracts within the data mesh.

Principles drive how the following subject areas impact your designs:

  • Data quality standards
  • Data contracts
  • Security and entitlement
  • Audit and regulations
  • Modeling and cataloging features
  • Self-service and metadata governed
  • Metrics and measures captured
  • Code steps and transforms
  • Error detection and correction

These subject areas then drive the need for a federated governance model in order to keep your solution relevant in the future.

Principle #10 – Metadata is associated with datasets and is relevant to the business

Metadata defines the data and how it became the end-state dataset used by downstream consumers, whether that be internal users, automated processing steps, big data repositories, or business intelligence (BI)-tooled end-user analysts. Metadata facets are exposed in data catalogs and subject to time series organization to provide filtering capabilities of the data in the data mesh, or data fabric if hosted in the cloud. Metadata is important since it represents the data and can be used to certify data in the mesh as meeting the data-contracted requirements of the business. It establishes the trust required to enable the data itself to be self-serviced.

Capturing metadata and retaining auditable versions of it in the past enables past data to be verifiable and understood in a prior context. It also provides change traceability that adds to current data credibility.

Metadata provides for the creation of facets that may be discovered through dictionary search capabilities. Metadata lineage capabilities aid in the forensic analysis of data quality issues. Domain ownership of key datasets and any value-added business enrichment may be discovered from metadata lineage. Shortening the time to assess errors, finding owners, and determining entitlement issues is an accelerator for the business. Exposing data facets created from metadata also enables you to implement facet intersection and, as such, obtain new insights. If you were to be given the goal to do more with less, you would want to leverage metadata.

Principle #11 – Dataset lineage and at-rest metadata is subject to life cycle governance

Metadata, whether it is semantically definitive of the domain owner’s truth or reflective of the data journey as lineage, will need to be tracked as if it were a core dataset. This form of second-order data being put into change control and treated as a product makes sense since it feeds the searchable data catalog of the architecture. Life cycle governance is needed for all datasets, including the metadata defining pipeline curated datasets. A key addition is the linkage between the curated dataset and its metadata instance as part of the named branch noted as third-order metadata.

You can envision multiple instances of datasets with linked metadata within a pipeline, with the final instance being the change set that is allowed to be released and, as such, propagated to the downstream zones. What determines the acceptability of the releasable instance of data with its metadata is the quality metric. Governance rules check for those contract conditions to be met, and if not, direct action is taken to make them acceptable. These include reprocessing, gap filling, synthetic data creation, data removal, trend smoothing or fitting, factorization, and other statistical enrichment processes.

Principle #12 – Datasets and metadata require cataloging and discovery services

The data catalog is not a static service. It is a searchable up-to-date reflection of the state of data within the data mesh. It maintains the data facets that make the data self-serviceable and the contacts that make it trustworthy.

Discovery and visualization will be key to the data mesh since over time, data loses pertinence. A consumer will want to dial in data at various levels. Often, data must be timely, or it needs to be up to date and correct for some other fixed point in time desired by the analyst. The ability to dial in the zoom when looking for insights must be a key self-service capability of the discovery tool of the data engineering solution.

Principle #13 – Semantic metadata guarantees correct business understanding at all stages in the data journey

It’s all about the semantics when bringing data together for analysis. Then, it’s all about the quality of the combined dataset before the assessment of an insight’s hypothesis for truth. You know that you can’t combine apples with oranges and get a valid resulting fruit. They don’t cross-pollinate, although you can do this for plumcot, which is a plum and apricot hybrid. The details of what can and can’t be combined are in the dataset’s semantic at-rest metadata. Preserving the domain owner’s understanding of the curated dataset enables the fair use, entitlements, and rights of the consumer.

Clear semantics may be preserved in the creation of a model in an OWL 2/RDF model knowledge graph that supports forward and backward inferencing. As an alternative, you may create an labeled property graph (LPG) knowledge base, but this is inherently unsuitable for reasoning purposes. For advanced knowledge graph capabilities in an LPG, you need a lot of code to glue the semantics together, but direct inferencing capabilities may have to be given up. A knowledge graph with instance data forms a knowledge base. With a formal knowledge base, you have a very strong and enforceable representation of the domain owner’s data in the mesh.

Principle #14 – Data big rock architecture choices (time series, correction processing, security, privacy, and so on) are to be handled in the design early

With this principle, you are being asked to handle architecture choices early and set up a framework for data engineering design success:

  • Time series data
  • Data corrections
  • Data entitlement, data rights, and data privacy

As stated earlier, data is time series sensitive. This was noted as a key issue requiring a solution in a data mesh. Often, data loses value based on age, and this affects how it is to be represented over time. In the simple case, it can be partitioned in a relational system and rolled off after the audit period expires, but that removes it from use unless rolled up as summary info when aged out. How this affects its metadata lineage is also important. Data with its associated metadata together have to be rolled off (or partitioned off) when the primary data ages out. Modern knowledge graphs often lack partitioning capability making this possible and, as such, must be cloned or custom copied over into special archival semantic models created to support a compressed form of historical data.

Likewise, current data corrections, historical restatements, and the effect on downstream data pipeline consumers must be clear in the design of the data mesh. This can’t be left to an afterthought without the data mesh, losing the trust it was built to preserve. We’ve built data systems that preserve data if the assessed change is less than 10% but create a new set if the change exceeds 10% deviation because of correction before becoming a restatement. That percentage should be a dialed-in number and not fixed as per the domain owner’s contract, but once set, it should not change for a given contract. A correction is a small change, and a restatement requires full downstream reprocessing.

Entitlement to view the existence of a dataset, the contents of a dataset, or reflect on (inspect) the dataset’s metadata must be separate types of entitlement. Additionally, you will want to add enhanced entitlements for fair use (how often data can be accessed), redistribution (ability to send data to another), value add (to enrich data), and resale (to redistribute with profit). Basically, if you can protect the datasets with a constraint, it can be an entitlement. Once you protect data with entitlements, you want to define the type of entitlement structure in your design. This is either via roles (RBAC) or attributes (ABAC), but you must deal with legacy group/ID-based IAM security of third-party products and convoluted cloud security services. These tools and services implement the latest security system du jour and do not integrate well with RBAC/ABAC– and you still have an OKR to implement a zero-trust secure system. Our suggestion is to generate the security architecture and keep to it even when faced with many integration obstacles.

Principle #15 – Implement foundational capabilities in the architecture framework first

For this principle the foundational principles componnents include: zero trust data security, test first design, data profiling, metadata design, auditing, and machine learning anomaly detection

The architecture of a future-proof data engineered modern solution will need to address a number of key areas identified in the principles discussed in the various sections of this chapter. These are the following:

  • Security
  • Test first design (TFD)
  • Data profiling
  • Metadata design
  • Audit
  • Machine learning (ML) based anomaly detection

Do not dive into the details of the data mesh implementation without scoping out necessary logical components first because you will not be able to fit them into the solution later if you do not handle them here.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime