Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Engineering Best Practices

You're reading from   Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781803244983
Length 550 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
David Larochelle David Larochelle
Author Profile Icon David Larochelle
David Larochelle
Richard J. Schiller Richard J. Schiller
Author Profile Icon Richard J. Schiller
Richard J. Schiller
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Chapter 1: Overview of the Business Problem Statement FREE CHAPTER 2. Chapter 2: A Data Engineer’s Journey – Background Challenges 3. Chapter 3: A Data Engineer’s Journey – IT’s Vision and Mission 4. Chapter 4: Architecture Principles 5. Chapter 5: Architecture Framework – Conceptual Architecture Best Practices 6. Chapter 6: Architecture Framework – Logical Architecture Best Practices 7. Chapter 7: Architecture Framework – Physical Architecture Best Practices 8. Chapter 8: Software Engineering Best Practice Considerations 9. Chapter 9: Key Considerations for Agile SDLC Best Practices 10. Chapter 10: Key Considerations for Quality Testing Best Practices 11. Chapter 11: Key Considerations for IT Operational Service Best Practices 12. Chapter 12: Key Considerations for Data Service Best Practices 13. Chapter 13: Key Considerations for Management Best Practices 14. Chapter 14: Key Considerations for Data Delivery Best Practices 15. Chapter 15: Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices 16. Chapter 16: Machine Learning Pipeline Best Practices and Processes 17. Chapter 17: Takeaway Summary – Putting It All Together 18. Chapter 18: Appendix and Use Cases 19. Index 20. Other Books You May Enjoy

Logical architecture overview

In the process of developing the logical architecture, there will be important integration considerations and risks to address regarding the technologies selected, the implementation processes, or the integration patterns needed. The best practices for them will be identified as key process methods for developing subsequent engineering designs. A logical reference architecture will be presented with a focus on the current state of the art in regard to implementing the concepts previously presented for the creation of data pipelines.

As with the question posed in the prior chapter, you may ask, What’s in the logical architecture?

The answer is: The level of communication with the engineers who implement features aligned with the capabilities presented in the conceptual architecture.

In feature-based development, you will find that if features are to be turned on and off, be subject to structured feature releases with their data, and have all that subject to service-level monitoring, these features should be implemented from one end to the other across the various data pipelines of the architecture. These observable features are to be experienced by the data consumer, and they are supported by end-to-end threads of activity as dataflows originating from raw data to end data publication. This is all orchestrated as the data pipeline (or runnable instances of generic data pipelines). Through the logical components depicted in the logical architecture, this flow will become clear.

Note

The flow of data is critical to being able to organize various software components’ activity and the data lineage tacked data states as data is curated through dataflows!

The data engineer is an important participant in the development of the logical architecture in order to represent the principles agreed upon when the high-level capabilities were created. These are made effective through features created in the logical component definitions of the data factory. The platform concepts depicted in the conceptual architecture are mapped to various software and data components/services defined with component features. As the glossary was to capabilities, so here the logical components are to their features. You will see the word feature used many times in this chapter and in the notes.

The processing of logical components defines dataflows of the factory, but the details will have to wait till the physical architecture is strawmanned and then completed when the solution is made generally available. Some of the user facing features will depend on repeatable patterns that flesh out blueprints that were presented at the conceptual level. The picture of a logical architecture should form in your mind now. Picture it as a diagram that reads from left to right as if data were flowing through a set of processes. This is what will be elaborated upon in this chapter.

In the flow, you can think of data going though zones, similar to how a boat traverses locks on a canal. These zone transitions are prominent in the logical layer that we are presenting as a best practice. The data zones needed to segment raw data from consumable data clearly are identified through the stages of the data factory.

A single-page representative diagram best illustrates this logical architecture level of the data engineering solution, similar to the one-page definition of the conceptual architecture. From this, you have to dive into the details and peel the onion to get to the next layer of the physical architecture for a complete understanding of the current state solution. Therefore, all layers should be kept up to date and in sync with each other at all times for ease of communication as changes (inevitably) are made over time.

The goals of this chapter are the following:

  • Define the contents of the logical architecture and illustrate them.
  • Provide best practices for that logical illustration and how it can be rendered to enable effective stakeholder communications.
  • Provide best practices for future-proofing the logical architecture with an emphasis on the features of a data factory, component tool selections, and cautions regarding anti-patterns detracting from best practices.

The three goals for this chapter will align with subsequent chapter elaboration regarding the physical architecture of a reference solution.

Organizing best practices

You will observe that we have adopted the following structure for the logical layer:

  • Best practice composition of the logical architecture.
  • What is the structure of a logical architecture (through an example)?
  • What capabilities and logical features are needed in the logical architecture?
  • Learning about features needed in the data processing zones and how datasets should be propagated between zones with a special focus on the types of use cases each data type will be subjected to:
    • Bronze data is used for data profiling, data normalization, and data cleansing use cases for big-data-lake processing.
    • Silver data is used for data pipeline processing or for big-data-lake online transaction processing (OLTP) and some online analytical processing (OLAP) operations.
    • Gold data is used for enterprise-grade mastered data and for OLTP and OLAP operations.
  • How to implement data processing zones:
    • The raw zone is used for data ingestion to the data factory. In this zone, raw data arrives and is normalized. Note that this often leads to a data lake implementation.
    • Transformation (bronze, silver, and gold) zones are used for data warehouse and data lake cloud storage. Note that this often leads to a delta lake or cloud data warehouse implementation.
    • The consumption zone is used for high-speed access and data analysis where consumer use cases and dataset deliveries take place. Note that this often leads to a business intelligence (BI) in-memory or data mart OLAP implementation.
  • Learning best practices regarding how to tie the conceptual architecture to the logical architecture.

With this structure, a data engineer will be able to communicate effectively to those at the highest levels of an organization without getting too much into the details, yet also be able to drill into details if challenged. Being able to still relate back to the high-level concepts for justification and low level details is an essential part of the logical layer of the architecture. It ties the physical implementation together with the concepts driving implementation.

How does the logical architecture align with the conceptual and physical architecture?

Development teams may only build and then deploy a solution once the architecture is defined to an acceptable level. This happens when the logical and physical architecture has been proven to be effective. Why? Because sometimes it’s not! Even the best of plans sometimes go awry, and changes are required. Technology can fail to work as expected or have integration anomalies, among other issues. Keeping the focus on what’s important will guide you as to when a proof of technology is needed before a plan to implement that technology.

Special attention will be given in this chapter to dataflows that implement the various data pipelines of the data factory. Data contracts and service level agreements (SLAs) are important features when formulating the architecture layer. These non-functional but very important features of a solution can’t be left till the end. They form a framework for your logic that will enable your team to develop robust solutions. How these features will support the capabilities of the architecture has to be made crystal clear. These will be subject to enterprise development processes such as release management, change management, entitlement, service management, and data contract management because they support the data products being curated.

The details are going to be important for your developers, but note that the fully configured dataflows will appear only in the physical architecture. Detailed choices for third party products and their integrations must be solid in the design and specified at this logical level. Selections have to be absent of scale, performance, reliability, or integration error; so, expect many POCs. These are required to vet out unacceptable risks (and there will be a lot of risk). The logical architecture will remain refinable as a work in progress until the physical architecture is complete. However, it is not as flexible since much subsequent design work is based on it being correct.

You are sometimes aware that you do not know how something works but can get answers; other times, you are not aware that you do not know how something really works and are clueless. Lastly, there are times when the integration just fails to live up to the hype, and workarounds are desperately needed to save the architecture from failure. Lastly, capabilities and their associated features may even be removed from the minimum viable product (MVP) if POCs fail to resolve some identified risk. That being said, you have to begin at the beginning.

Let’s look at an example of a logical architecture in real life that illustrates many of the concepts we wish to highlight as essential in a future-proof data engineering solution. After all, it was Albert Einstein who said: "The only source of knowledge is experience." And we do want to learn from those who experienced the pains of putting together a great logical architecture.

Case study – accelerating innovation at JetBlue using Databricks

Data and artificial intelligence (AI) technology are critical in real-time proactive decision-making. Leveraging legacy data architecture platforms will impact future business outcomes. In the past, JetBlue data was served through a multi-cloud data warehouse that resulted in a lack of flexibility for advanced designs, latency changes, and cost scalability {https://packt-debp.link/VTQXrf}. Some issues the new logical architecture at JetBlue addressed are the following:

  • High latency of over 10 minutes cost the organization millions of US dollars.
  • A complex architecture with multiple stages, across multiple platforms with a lot of bulk data movement was inefficient for real-time streaming.
  • The platform’s high total cost of ownership (TCO) was evident. Many vendor data platforms were present in order to manage the dataflows, along with their high operating costs.
  • Scaling up the legacy data architecture when trying to process exabytes of data generated by many flights was just not possible.

JetBlue adapted Databricks’s medallion architecture {https://packt-debp.link/S21L4y} to their needs. But what does this medallion architecture pattern/blueprint define? In essence, it is a blueprint used to logically organize data in a lakehouse (as pointed out by Databricks), with the goal of incrementally (or progressively) improving the structure and quality of data as it flows through each layer of the architecture (from bronze to silver to gold data). Medallion architectures are sometimes also referred to as multi-hop architectures, among other names. This approach is clearly in line with the best practices set forth in this book.

Multi-hop designs are pragmatic! They cover the need for a broad formalized process similar to what is diagramed in Figure 6.1 but with more ridged zones. It is worth studying many architectures to learn from their success as well as minor failures that later turned into successes through agile development. Refer to the following diagram and follow the components, observations, and especially the features noted afterward:

Figure 6.1 – JetBlue’s data analytics and machine learning (ML) medallion logical architecture {https://packt-debp.link/9eKqW2}

Figure 6.1 – JetBlue’s data analytics and machine learning (ML) medallion logical architecture {https://packt-debp.link/9eKqW2}

Looking at the inventory of components in the logical dataflow, we see the following list (Note: We have omitted the monitor (at the right side of the diagram) and support features (at the bottom) since they are mainly related to the software processing rather than the data engineering features of the system):

  • Raw (raw zone data):
    • Numerous input sources of raw data are consumed, such as Weather, NASA (Airspace), Sabre (Bookings), Air Traffic Control (ATC), AirBus (Aircraft), Crew, Adobe (Customer Web Analytics), Network, SAP (Transactions), and Oracle (Parts & Inventory).
  • Ingest (bronze zone data):
    • Ingest of normalized, but still relatively raw data is provided via Azure Service Bus, REST API (connectors), Azure Data Factory (ADF), Fivetran (Connectors), DBT Cloud, and MCDW (data warehouse).
    • Store data will retain all transformed bronze data via Azure Data Lake Storage (ADLS) Gen 2, Databricks Autoloader, and Azure Cosmos DB.
  • Process (silver zone data):
    • Notebooks, pipelines, and jobs are used to internally process data via Databricks Spark with notebooks, Delta Live Table (DLT) pipelines, and Databricks jobs.
    • Forecasting, digital twins, and predictive maintenance are services provided by MLflow, custom software, and so on.
    • Interactive queries and real-time dashboards are big data services from Azure Databricks SQL Analytics.
  • Process (gold zone data):
    • Metadata governance and security features are provided via Databricks Unity Catalog.
    • ACID transactions, relational schemas, auto-indexing, and partitioning are usually data warehouse features or delta lake features provided by Azure Delta Lake.
    • The feature store for real-time search, analytics, and semantics layer is a very fast scalable data cache, in this case provided via Rockset’s in-memory SaaS DB.
  • Service (consumption zone data):
    • Transformations are provided by AutoML, AutoDeploy, and feature store integration from BlueML custom applications.
    • Transformations are additionally affected through models that are operationally contracted by MLflow’s Model Registry, serving, and versioning.
    • Analytics applications are created for many business domains through customized tooling: custom enterprise applications (such as demand forecasting, purchase funnel recapture, recommendation engine, digital twins, anomaly detection, and so on).
    • Dashboards for the consumer, DataOps, DevOps, and Ops are provided via custom enterprise reporting (such as scheduled extract dashboards, real-time dashboards, and analyst visualizations).

A primary takeaway from the preceding real-world example is that the Process stage appears to cause silver and gold datasets to undergo data munging {https://packt-debp.link/X6bpE6} in order to curate consumable datasets from the relatively raw bronze datasets. This is so that applications/consumers can analyze data. Data factory best practices should be applied so that datasets of different classes do not ever mix with each other. Without clear separation of this in the Process stage, complications can arise and cause two classes of data to be comingled during analytics, leading to incorrect results.

This does not take away from the excellent solution that was designed by the architects at JetBlue IT and depicted here. It was built to serve their business needs, and they did not have to treat data or its downstream enriched data as a data product, as per the principle we have adopted for our architecture. The data mesh principle has been applied where data is considered a product. In the logical architecture, we see that the choices made are not theoretical but grounded in business reality; practical choices had to be made. We also see that in the JetBlue logical architecture, features are often identified under the component delivered. This happens when the tool itself is the center of attention and not the feature set of the tool.

How those features are going to be brought into the dataflow and acted upon by third party tools, technologies, or Azure cloud services is less clear. You have to specify just what is needed to get the message across to the engineers and owners/stakeholders of the solution; so, you have to be willing to compromise on the level of detail evident in the logical architecture. Some architects use enterprise architecture tools at this point in the architecture design process, such as the following:

Using one of these tools helps provide the drill-down mappings that will make the physical architecture definition accurate and easily maintainable as changes are made. They also take away the vendor tool focus. A tool also helps with the level of detail needed for bigger solutions without cluttering up the high-level, single-page logical diagram. If you have a budget for it, then you will find that funding spent on a tool is money well spent. What you will also need to know is that somebody has to be designated as the trainer and enforcer of architecture modeling processes, which can be onerous with some of these tools.

As you probably already know, architects can get enamored with their illustrations and especially their tools; so, get everybody on the same page before the selection of the tool becomes a battle not worth fighting! Focus on capabilities, and in the next section, we’ll elaborate on this.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime