Data Engineering Best Practices

Overview of the Business Problem Statement

We begin with the task of defining the business problem statement.

“Businesses are faced with an ever-changing technological landscape. Competition requires one to innovate at scale to remain relevant; this causes a constant implementation stream of total cost of ownership (TCO) budget allocations for refactoring and re-envisioning during what would normally be a run/manage phase of a system’s lifespan.”

This rapid rate of change means the goalposts are constantly moving. “Are we there yet?” is a question I heard from my kids constantly when traveling. It came from not knowing where we were or having any idea of the effort to get to where we were going, with a driver (me) who had never driven to that destination before. Thank goodness for Garmin (automobile navigation systems) and Google Maps, and not the outdated paper maps that were used in the past. See how technology even impacted that metaphor? Garmin is being displaced by Google for mapping use cases. This is not always because it is better but because it is free (if you wish to be subjected to data collection and advertising interruptions) and it is hosted on everyone’s smart device.

Now, I can tell my grandkids that in exactly 1 hour and 29 minutes, they will walk into their home after spending the weekend with their grandparents. The blank stare I get in response tells it all. Mapped data, rendered with real-time technology, has changed us completely.

Technological change can appear revolutionary when it’s occurring, but when looking back over time, the progression of change appears to be a no-brainer series of events that we take for granted, and even evolutionary. That is what is happening today with data, information, knowledge, and analytical data stores in the cloud. The term DataOps was popularized by Andy Palmer, co-founder and CEO of Tamr {https://packt-debp.link/MGj4EU}. The data management and analytics world has referenced the term often. In 2015, Palmer stated that DataOps is not just a buzzword, but a critical approach to managing data in today’s complex, data-driven world.

I believe that it’s time for data engineers and data scientists to embrace a similar (to DevOps) new discipline – let’s call it DataOps – that at its core addresses the needs of data professionals on the modern internet and inside the modern enterprise. (Andy Palmer {https://packt-debp.link/ihlztK})

In Figure 1.1, observe how data quality, integration, engineering, and security are tied together with a solid DataOps practice:

Figure 1.1 – DataOps in the enterprise

The goal of this chapter is to set up the foundation for understanding why the best practices presented are structured as they are in this book. This foundation will provide a firm footing to make the framework you adopt in your everyday engineering tasks more secure and well-grounded. There are many ways to look at solutions to data engineering challenges, and each vendor, engineering school, and cloud provider will have its own spin on the formula for success. That success will ultimately depend on what you can get working today and keep working in the future. A unique balance of various forces will need to be obtained. However, this balance may be easily upset if the foundation is not correct. As a reader, you will have naturally formed biases toward certain engineering challenges. These can force you into niche (or single-minded) focus directions – for example, a fixation on robust/highly available multi-region operations with a de-emphasized pipeline software development effort. As a result, you may overbuild robustness and underdevelop key features. Likewise, you can focus on hyper-agile streaming of development changes into production at the cost of consumer data quality. More generally, there is a significant risk from just doing IT and losing focus on why we need to carefully structure the processing of data in a modern information processing system. You must not neglect the need to capture data with its semantic context, thus making it true and relevant, instead of the software system becoming the sole interpretation of the data. This freedom makes data and context equal to information that is fit for purpose, now and in the future.

We can begin with the business problem statement.

What is the business problem statement?

Data engineering approaches are rapidly morphing today. They will coalesce into a systemic, consistent whole. At the core of this transformation is the realization that data is information that needs to represent facts and truths along with the rationalization that created those facts and truths over time. There must not be any false facts in future information systems. That term may strike you as odd. Can a fact be false? This question may be a bit provocative. But haven’t we often built IT systems to determine just that?

We process data in software systems that preserve business context and meaning but force the data to be served only through those systems. It does not stand alone and if consumed out of context, it would lead to these false facts propagating into the business environment. Data can’t stand alone today; it must be transformed by information processing systems, which have technical limitations. Pragmatic programmers’ {https://packt-debp.link/zS3jWY} imperfect tools and technology will produce imperfect solutions. Nevertheless, the engineer is still tasked with removing as many as possible, if not all, false facts when producing a solution. That has been elusive in the past.

We often take shortcuts. We also justify these shortcuts with statements like: “there simply is not enough time!” or “there’s no way we can get all that data!” The business “can’t afford to curate it correctly,” or lastly “there’s no funding for boiling the ocean.” We do not need to boil the ocean.

What we are going to think about is how we are going to turn that ocean directly into steam! This should be our response, not a rollover! This rethinking mindset is exactly what is needed as we engineer solutions that will be future-proof. What is hard is still possible if we rethink the problem fully. To turn that metaphor around – we will use data as the new fuel for the engine of innovation.

Fun fact

In 2006, mathematician Clive Humby coined the phrase “data is the new oil” {https://packt-debp.link/SiG2rL}.

Data systems must become self-healing of false facts to enable them to be knowledge-complete. After all, what is a true fact? Is it not just a hypothesis backed up by evidence until such time that future observations disprove a prior truth? Likewise, organizing information into knowledge requires not just capturing semantics, context, and time series relevance but also the asserted reason for a fact being represented as information truth within a dataset. This is what knowledge defines: truth! However, it needs correct representation.

Note

The truth of a knowledge base is composed of facts that are proven by assertions that withstand the test of time and do not hide information context that makes up the truth contained within the knowledge base.

But sometimes, when we do not have enough information, we guess. This guessing is based on intuition and prior experience with similar patterns of interconnected information from related domains. We humans can be very wrong with our guesses. But strongly intuited guesses can lead to great leaps in innovation which can later be backfilled with empirically collected data.

Until then, we often stretch the truth to span gaps of knowledge. Information relationship patterns need to be retained as well as the hypothesis recording accurate educated guesses. In this manner, data truths can be guessed. They can also be guessed well! These guesses can even be unwound when proven to be wrong. It is essential that data is organized in a new way to support intelligence. Reasoning is needed to support or refute hypotheses, and the retention of information as knowledge to form truth is essential. If we don’t address organizing big data to form knowledge and truth within a framework consumable by the business, we are just wasting cycles and funding on cloud providers.

This book will focus on best practices; there are a couple of poor practices that need to be highlighted. These form anti-patterns that have crept into the data engineer’s tool bag over time that hinder the mission we seek to be successful in. Let’s look into these anti-patterns next.

Anti-patterns to avoid

What are anti-patterns? These are architectural patterns that form blueprints for ease of implementation. Just like when building a physical building, a civil architect will use blueprints to definitively communicate expectations to the engineers. If a common solution is recurring and successful, it is reused often as a pattern, like the framing of a wall or a truss for a type of roofline. Likewise, an anti-pattern is a pattern to be avoided: like not putting plumbing on an outside wall in a cold climate, because the cold temperature could freeze those pipes.

The first anti-pattern we describe deals with retaining stuff as data that we think is valuable but can no longer even be understood or processed given how it was stored, and it’s contextual meaning gets lost since it was never captured when the data was first retained in storage (such as cloud storage).

The second anti-pattern involves not knowing the business owner’s meaning for column-formatted data, nor how those columns relate to each other to form business meaning because this meaning was only preserved in the software solution, not in the data itself. We rely on entity relationship diagrams (ERDs), that are not worth the paper they were printed on, to gain some degree of clarity that is lost the next time an agile developer does not update them. Knowing what we must avoid in the future as we develop a future-proof, data-engineered solution will help set the foundation for this book.

In order to get a better understanding of the two anti-patterns just introduced, the following specific examples should help illustrate what to avoid.

Anti-pattern #1 – Must we retain garbage?

As an example of what not to do, in the past, I examined a system that retained years of data, only to be reminded that the data was useless after three months. This is because the processing code that created that data had changed hundreds of times in prior years and continued to evolve without being noted in the dataset produced by that processing. The assumptions put into those non-mastered datasets were not preserved in the data framework. Keeping that data around was a red herring, just waiting for some future big data analyst to try and reuse it. When I asked, “Why was it even retained?” I was told it had to be, according to company policy. We are often faced with someone who thinks piles of stuff are valuable, even if they’re not processable. Some data can be the opposite of valuable. It can be a business liability if reused incorrectly. Waterfall gathered business requirements or even loads of agile development stories will not solve this problem without a solid data framework for data semantics as well as data lineage for the data’s journey from information to knowledge. Without this smart data framework, the insights gathered would be wrong!

Anti-pattern #2 – What does that column mean?

Likewise, as another not-to-do example, I once built an elaborate, colorful graphical rendering of web consumer usage across several published articles. It was truly a work of art, though I say so myself. The insight clearly illustrated that some users were just not engaging a few key classes of information that were expensive to curate. However, it was a work of pure fiction and had to be scrapped! This was because I misused one key dataset column that was loaded with data that was, in fact, the inverted rank of users access rather than an actual usage value.

During the development of the data processing system, the prior developers produced no metadata catalog, no data architecture documentation, and no self-serve textual definition of the columns. All that information was retained in the mind of one self-serving data analyst. The analyst was holding the business data hostage and pocketing huge compensation for generating insights that only that individual could produce. Any attempt to dethrone this individual was met with one key and powerful consumer of the insight overruling IT management. As a result, the implementation of desperately needed governance mandated enterprise standards for analytics was stopped. Using the data in such an environment was a walk through a technical minefield.

Organizations must avoid this scenario at all costs. It is a data-siloed, poor-practice anti-pattern. It arises due to individuals seeking to preserve a niche position or a siloed business agenda. In the case just illustrated, that anti-pattern was to kill the use of the governance-mandated enterprise standard for analytics. The problem can be protected from abuse by properly implementing governance in the data framework where data becomes self-explanatory.

Let’s consider a real-world scenario that illustrates both of these anti-patterns. A large e-commerce company has many years of customer purchase data that includes a field called customer_value. Originally, this field was calculated using the total amount the customer spent, but its meaning has changed repeatedly over the years without updates to the supporting documentation. After a few years, it was calculated as total_spending – total_returns. Later, it becomes predicted_lifetime_value based on a machine learning (ML) model. When a new data scientist joins the company and uses the field to segment customers for a marketing campaign, the results are disastrous! High value customers from early years are undervalued while new customers are overvalued! This example illustrates how retaining data without proper context (Anti-pattern #1) and lack of clear documentation for data fields (Anti-pattern #2) can lead to significant mistakes.

Patterns in the future-proof architecture

Our effort in writing this book is to strive to highlight for the data engineer the reality that in our current information technology solutions, we process data as information, when, in fact, we want to use it to inform the business knowledgably.

Today, we glue solutions together with code that manipulates data to mimic information for business consumption. What we really want to do is to retain the business information with the data and make the data smart so that information in context forms knowledge that will form insights for the data consumer. The progression of data begins with just raw data that is transformed into information, and then knowledge, through the preservation of semantics along with context; and finally, the development of analytic derived insights will be elaborated on in future chapters. In Chapter 18, we have included a number of use cases that you will find interesting. From my experience over the years, I’ve learned that making data smarter has always been rewarded.

The resulting insights may be presented to the business in new innovative manners when the business requires those insights from data. The gap we see in the technology landscape is that in order for data to be leveraged as an insight generator, its data journey must be an informed one. Innovation can’t be pre-canned by the software engineer. It is teased out of the minds of business and IT leaders from the knowledge the IT data system presents from different stages of the data journey. This requires data, its semantics, its lineage, its direct or inferred relationships to concepts, its time series, and its context to be retained.

Technology tools and data processing techniques are not yet available to address this need in a single solution, but the need is clearly envisioned. One monolithic data warehouse, data lake, knowledge graph, or in-memory repository can’t solve the total user-originated demand today. Tools need time to evolve. We will need to implement tactically and think strategically regarding what data (also known as truths) we present to the analyst.

Key thought

Implement: Just enough, just in time.

Think strategically: Data should be smart.

Applying innovative modeling approaches can bring systemic and intrinsic risk. Leveraging new technologies will produce key advantages for the business. Minimizing the risk of technical or delivery failure is essential. When thinking of the academic discussions debating data mesh versus data fabric, we see various cloud vendors and tool providers embracing the need for innovation… but also creating a new technical gravity that can suck in the misinformed business IT leader.

Remember, this is an evolutionary event and for some it can become an extinction level event. Microsoft and Amazon can embrace well architected best practices that foster greater cloud spend and greater cloud vendor lock-in. Cloud platform-as-a-service (PaaS) offerings, cloud architecture patterns, and biased vendor training can be terminal events for a system and its builders. The same goes for tool providers such as the creators of relational database management systems (RDBMS), data lakes, operational knowledge graphs, or real-time in-memory storage systems. None of the providers or their niche consulting engagements come with warning signs. As a leader trying to minimize risk and maximize gain, you need to keep an eye on the end goal:

“I want to build a data solution that no one can live without – that lasts forever!”

To accomplish this goal, you will need to be very clear on the mission and retain a clear vision going forward. With a well-developed set of principles, best practices, and clear position regarding key considerations, with an unchallenged governance model … the objective is attainable. Be prepared for battle! The field is always evolving and there will be challenges to the architecture over time, maybe before it is even operational. Our suggestion is to always be ready for these challenges and do not count on political power alone to enforce compliance or governance of the architecture.

You will want to consider these steps when building a modern system:

Collect the objectives and key results (OKRs) from the business and show successes early and often.
Always have a demo ready for key stakeholders at a moment’s notice for key stakeholders.
Keep those key stakeholders engaged and satisfied as the return on investment (ROI) is demonstrated. Also, consider that they are funding your effort.
Keep careful track of the feature to cost ratio and know who is getting value and at what cost as part of the system’s total cost of ownership (TCO).
Never break a data service level agreement (SLA) or data contract without giving the stakeholders and users enough time to accommodate impacts. It’s best to not break the agreement at all, since it clearly defines the data consumer’s expectations!
Architect data systems that are backwardcompatible and never produce a broken contract once the business has engaged the system to glean insight. Pulling the rug out from under the business will have more impact than not delivering a solution in the first place, since they will have set up their downstream expectations based on your delivery.

You can see that there are many patterns to consider and some to avoid when building a modern data solution. Software engineers, data admins, data scientists, and data analysts will come with their perspectives and technical requirements in addition to objectives and key results (OKRs) that the business will demand. Not all technical players will honor the nuances that their peers’ disciplines require. Yet, the data engineer has to deliver the future-proof solution while balancing on top of a pyramid change.

In the next section, we will show you how to keep the technological edge and retain the balance necessary to create a solution that withstands the test of time.

Future-proofing is …

To future-proof a solution means to create a solution that is relevant to the present, scalable, and cost-effective, and will still be relevant in the future. This goal is attainable with a constant focus on building out a reference architecture with best practices and design patterns.

The goal is as follows:

Develop a scalable, affordable IT strategy, architecture, and design that leads to the creation of a future-proof data processing system.

When faced with the preceding goal, you have to consider that change is evolutionary rather than revolutionary. That means that a data architecture is solid and future-proof. Making a system 100% future-proof is an illusion; however, the goal of attaining a near future-proof system must always remain a prime driver of your core principles.

The attraction of shiny lights must never become bait to catch an IT system manager in a web of errors, even though cool technology may attract a lot of venture and seed capital or even create a star on one’s curriculum vitae (CV). It may just as well all fade away after a breakthrough in a niche area is achieved by a disrupter. Just look at what happened when OpenAI, ChatGPT, and related large language model (LLM) technology started to roll out. Conversational artificial intelligence (AI) has changed many systems already.

After innovation rollout, what was once hard is now easy and often available in open source to become commoditized. Even if a business software method or process-oriented intellectual property (IP) is locked away with patent protection… after some time – 10, 15, or 20 years – it is also free for reuse. In the filing disclosure of the IP, valuable insights are also made available to the competition. There can only be so many cutting-edge tech winners, and brilliant minds tend to develop against the same problem at the same time until a breakthrough is attained, often creating similar approaches. It is at this stage that data engineering is nearing an inflection point.

There will always be many more losers than winners. Depending on the size of an organization’s budget and its culture for risk/reward, there can arise a shiny light idea that becomes a blazing star. 90% of those who pursue the shooting star wind up developing a dud that fades away along with an entire IT budget. Our suggestion is to follow the business’s money and develop agilely to minimize the risk of IT-driven failure.

International Data Corporation (IDC) and the business intelligence organization Qlik came up with the following comparison:

“Data is the new water.”

You can say that data is oil or it is water – a great idea is getting twisted and repurposed, even in these statements. It’s essential that data becomes information and that information is rendered in such a way as to create direct, inferred, and derived knowledge. Truth needs to be defined as knowledge in context, including time. We need systems to be not data processing systems but knowledge aware systems that support intelligence, insight, and development of truths that withstand the test of time. In that way, a system may be future-proof. Data is too murky, like dirty water. It’s clouded by the following:

Nonsense structures developed to support current machine insufficiency
Errors due to misunderstanding of the data meaning and lineage
Deliberate opacity due to privacy and security
Missing context or state due to missing metadata
Missing semantics due to complex relationships not being recorded because of missing data and a lack of funding to properly model the data for the domain in which it was collected

Data life cycle processes and costs are often not considered fully. Business use cases drive what is important (note: we will elaborate a lot more on how use cases are represented by conceptual, logical, and physical architectures in Chapters 5-7 of this book). Use cases are often not identified early enough. The data services that were implemented as part of the solution are often left undocumented. They are neither communicated well nor maintained well over the data’s timeframe of relevancy. The result is that the data’s quality melts down like a sugar cube left in the rain. It loses its worth organically as its value degrades in time. Data efficacy loses value over time. This may be accelerated by the business and technical contracts not being maintained, and without that maintenance comes the loss of trust in a dataset’s governance. The resulting friction between business silos becomes palpable. A potential solution has been to create business data services with data contracts. These contracts are defined by well-maintained metadata, and describe the dataset at rest (its semantics) as well as its origin (its lineage) and security methods. They also include software service contracts for the timely maintenance of the subscribed quality metrics.

Businesses need to enable datasets to be priced, enhanced as value-added sets, and even sold to the highest bidder. This is driven over time by the cost of maintaining data systems, which can only increase. The data’s relevance (correctness), submitted for value-added enrichment and re-integration into commoditized data exchanges, is a key objective:

Don’t move data; enrich it in place along with its metadata to preserve semantics and lineage!

The highest bidder builds on the data according to the framework architecture and preserves the semantic domain for which the data system was modeled. Like a ratchet that never loses its grip, datasets need to be correct and hold on to the grip of reality over time. This reality for which the dataset was created can be proposed by the value-added resellers without sacrificing the quality or data service level.

Observe that, over time, the cost of maintaining data correctness, context, and relevance will exceed any single organization’s ability to sustain it for a domain. Naturally, it remains instinctual for the IT leader to hold on to the data and produce a silo. This natural reality to hide imperfections for an established system that is literally melting down must be fixed in the future data architecture’s approach. Allowing the data to evolve/drift, be value-added, and yet remain correct and maintainable is essential. Imperfect alignment of facts, assertions, and other modeled relationships within a domain would be diminished with this approach.

Too often in today’s processing systems, the data is curated to the point where it is considered good enough for now. Yet, it is not good enough for future repurposing. It carries all the assumptions, gaps, fragments, and partial data implementations that made it just good enough. If the data is right and self-explanatory, its data service code is simpler. The system solution is engineered to be elegant. It is built to withstand the pressure of change since the data organization was designed to evolve and remain 100% correct for the business domain.

“There is never enough time or money to get it right… the first time! There is always time to get it right later… again and again!’”

This pragmatic approach can stop the IT leader’s search for a better data engineering framework. Best practices could become a bother since the solution just works, and we don’t want to fix what works. However, you must get real regarding the current tooling choices available. The cost to implement any solution must be a right fit, yet as part of the architecture due diligence process, you still need to push against the edge of technology to seize on innovation opportunities, when they are ripe for the taking.

Consider semantic graph technology in OWL-RDF and its modeling and validation complexities via SPARQL, compared to using the labeled property graphs with custom code for the semantic representation of data in a subject domain’s knowledge base. Both have advantages and disadvantages; however, neither scales without implementing a change-data-capture mechanism syncing an in-memory analytics storage area for real time analytics use case support. Cloud technology has not kept up with making a one-size-fits all, data store, data lake, or data warehouse. It’s better said that one technology solution to fit all use cases and operational service requirements does not exist.

Since one size does not fit all, one data representation does not fit all use cases.

A monolithic data lake, Delta Lake, raw data storage, or data warehouse does not fit the business needs. Logical segmentation and often physical segmentation of data are needed to create the right-sized solution needed to support required use cases. The data engineer has to balance cost, security, performance, scale, and reliability requirements, as well as provider limitations. Just as one shoe size does not fit all… the solution has to be implementable and remain functional over time.

Organization into zone considerations

One facet of the data engineering best practices presented in this book is the need for a primary form of data representation for important data patterns. A raw ingest zone is envisioned to hold input Internet of Things (IoT) data, raw retailer point-of-sale data, chemical property reference data, or web analytics usage data. We are proposing that the concept of the zone be a formalization of the layers set forth in the Databricks Medallion Architecture (https://www.databricks.com/glossary/medallion-architecture). It may be worth reading through the structure of that architecture pattern or waiting until you get a chance to read Chapter 6, where a more detailed explanation is provided.

Raw data may need data profiling systems processing applied as part of ingest processing, but that is to make sure that any input data is not rejected due to syntactic or semantic incorrectness. This profiled data may even be normalized in a basic manner prior to the next stage of processing in the data pipeline journey. Its transformation involves processing into the bronze zone and later into the silver zone, then the gold zone, and finally made ready for the consumption zone (for real-time, self-serve analytics use cases).

The bronze, silver, and gold zones host information of varying classes. The gold zone data organization looks a lot like a classic data warehouse, and the bronze zone looks like a data lake, with the silver zone being a cache enabled data lake with a lot of derived, imputed, and inferred data drawn from processing data in the bronze zone. This silver zone data supports online transaction processing (OLTP) use cases but stores processed outputs in the gold zone. The gold zone may also support OLTP use cases directly against information.

The consumption zone is enabled to provide for the measures, calculated metrics, and online analytic processing (OLAP) needs of the user. Keeping it all in sync can become a nightmare of complexity without a clear framework and best practices to keep the system correct. Just think about the loss of linear dataflow control in an AWS or Azure cloud PaaS solution required to implement this zone blueprint. Without a clear architecture, data framework, best practices, and governance… be prepared for many trials and errors.

Cloud limitations

Data engineering best practices must take into consideration current cloud provider limitations and constraints that drive cost for data movement and third-party tool deployment for analytics when architecting. Consider the ultimate: a zettabyte cube of memory with sub-millisecond access for terabytes of data, where compute code resides with data to support relationships in a massive fabric or mesh. Impossible, today! But wait… maybe tomorrow this will be reality. Meanwhile, how do you build today in order to effortlessly move to that vision in the future? This is the focus of the best practices of this book. All trends point to the eventual creation of big-data, AI enabled data systems.

There are some key trends and concepts forming as part of that vision. Data sharing, confidential computing, and concepts such as bring your algorithm to the data must be considered as core approaches to repurposing datasets, their semantics, and their business value as data leaves the enterprise and enters the publicly available domain. With the data, information, and derived knowledge comes the data consumption handle, which is more than loosely defined metadata. It consists of the lineage, semantics, context, and timely value needed to sustain trust so that monetary compensation for the stream of value-added information will be possible. These royalty contracts make data resellable and demystified. Just like a book is published, so can data be published. The best practices of this book will position data to support the development of value-added services over curated information in the course of time. Smart data becomes a value-added ecosystem in and of itself, which is as important as the software data processing systems of past generations, where data was but a snapshot of the processing state for which that system was created.

The Intelligence Age

Future state data-as-a-service (DaaS) offerings will depend on the new data engineering framework that this book will highlight. Also, we will show the best practice considerations for the development framework required by that structure. The process of curating knowledge from data, and its metadata into information and then insights, will involve the process of transforming data into truths that withstand the test of time. This novel data framework required for the commoditization of information is essential as we enter the Intelligence Age and exit the Information Age. In the Intelligence Age, insights are gathered from knowledge formed from information operated upon by human and AI systems.

Achieving AI goals requires the application of various machine and deep learning algorithms that define the Intelligence Age. Along with these algorithms, you can envision the development of extremely advanced quantum computers and zettabyte in-memory storage arrays. These will be needed as part of the new data engineering organization. What is often not discussed is the data engineering framework required to facilitate the algorithms; otherwise, the data lake will become a data swamp in short order. The power of computing will advance in leaps and bounds. What we build today in software systems paves the way for those hardware advances – not the reverse. Software algorithms drive the hardware computing power needed; otherwise, the hardware remains underutilized. Data engineering has been a laggard in the evolutionary process.

For this purpose, this book has been created to future-proof data engineering architectures and designs by providing best practices with considerations for senior IT leadership thinking. Along with that come some practical data architecture approaches with use case justifications for the data architect and data engineer; these will add color to those best practice descriptions.

Use case definitions

Question: Why should the development of any data engineering system be use-case-driven?

Answer: If one cannot develop a solution that integrates with the business’s needs, it is irrelevant; it can’t be communicated, nor can its efficacy be quantifiable.

Without use cases, a data processing system does not provide the tangible value required to keep up funding continuance. Even if a solution is the best thing since peanut butter, it will quickly devolve into an ugly set of post-mortem discussions when failures to meet expectations start to arise.

The solution needs to be built as a self-documenting, self-demonstrable collection of use cases that support a test-driven approach to delivery.

It’s part art and part science, but fully realizable with a properly focused system data architecture. Defining reference use cases and how they will support the architecture is a high bar to achieve. As the use cases are created and layered into the development plans as features of the solution, you must not get lost in the effort. To keep the focus, you need a vision, a strategy, and a clear mission.

The mission, the vision, and the strategy

You should begin with the mission and vision for which this overview section has laid a foundation. These should be aligned with the organization’s strategy, and if they are not… then alignment must be achieved. We will elaborate more on this in subsequent sections.

Principles and the development life cycle

Principles govern the choices made in the development of the business strategy defining the architecture, where the technologists apply art and science to fulfill the business needs. Again, alignment is required and necessary; otherwise, there will be difficulties when the first problems arise and they are not easily surmountable. The cost of making mistakes early is far greater than making errors later in the engineering development life cycle. The data engineering life cycle begins with architecture.

The architecture definition, best practices, and key considerations

The architecture can be developed in many ways, but what we as engineers, architects, and authors have discovered is that the core architecture deliverable needs to have three main components:

A conceptual architecture
A logical architecture
A physical architecture

The conceptual architecture shows the business mission, vision, strategy, and principles, clearly implemented as an upward and outward facing communications tool. In the conceptual architecture’s definition, there will be a capabilities matrix that shows all the capabilities needed for your solution and these will be mapped deliverables. This will be the focus of Chapter 5, but for now, it is enough to know that the foundation of the solutions’ concepts will be your principles that are aligned with the vision, mission, and strategy of your business.

The logical architecture shows the software services and dataflows necessary to implement the conceptual architecture and ties the concepts to the logical implementation of those concepts. The physical architecture defines the deployable software, configurations, data models, ML models, and cloud infrastructure of the solution.

Our best practices and key considerations are drawn from years of experience with big data processing systems and start-ups in the areas of finance, health, publishing, and science. When working in those areas, projects included analytics of social media, health, and retail analytic data.

Use cases can be created using information contained in the logical as well as the physical architecture:

Logical use cases:
- Software service flows
- Dataflows
Physical use cases include:
- Infrastructural configuration information
- Operational process information
- Software component inventory
- Algorithm parameterization
- Data quality/testing definition and configuration information
- DevOps/MLOps/TestOps/DataOps trace information

Reusable design patterns are groupings of these use cases that have clean interfaces and are generic enough to be repurposed across data domains, therefore reducing the cost to develop and operate these patterns. With the simplification of the software design due to the smart data framework’s organization, use cases will coalesce into patterns easily. This will be an accelerator for software development in the future. Dataflows will be represented by these design patterns, which make them more than just static paper definitions. They will be operational design patterns that reflect the data journey through the data framework’s engineered solution that is aligned with the architecture.

The DataOps convergence

The data journey is a path from initial raw data ingestion through classification that ultimately positions transformed information for end user consumption. Curated, consumable data progresses through various zones of activity. These are going to be defined better in subsequent chapters, but the zones are bronze, silver, and gold. Datasets are curated in a data factory manner that is logically and physically grouped into these zones of activity. All custom built and configured data factory hosted data pipeline journeys utilize a data engineer’s standard process, which you will develop; otherwise, IT operations and the maintenance of service levels through agreeable contracts would be at risk. Data transformation and cataloging activities are centered around what others have coined DataOps.

By 2025, a data engineering team guided by DataOps practices and tools will be 10 times more productive than teams that do not use DataOps. (2022 Gartner Market Guide for DataOps Tools, {https://packt-debp.link/6JRtF4})

DataOps, according to Gartner, is composed of five core capabilities:

Orchestration capabilities involve the following:
- Connectivity
- Scheduling
- Logging
- Lineage
- Troubleshooting
- Alerting
- Workflow automation
Observability capabilities enable the following:
- Monitoring of live or historic workflows
- Insights into workflow performance
- Cost metrics
- Impact analysis
Environment management capabilities cover the following:
- Infrastructure as code (IaC)
- Resource provisioning
- Credential management
- IaC templates (for reuse)
Deployment automation capabilities include the following:
- Version control
- Approvals
- Cloud CI/CD and pipelines
Test automation capabilities provide the following:
- Validation
- Script management
- Data management

To illustrate how these DataOps principles can be applied, imagine a large retail company deploying an inventory management system. See Figure 1.2:

Figure 1.2 – Retail inventory management capabilities

Many third party vendors have jumped on the DataOps hype and produced fantastic tooling to jumpstart the convergence of DevOps, MLOps, and TestOps practices for modern cloud data systems.

The data engineering best practices of this book will also support the DataOps practices noted by Gartner while remaining neutral to the specific tooling choices. The focus will be on the data engineering framework that the DataOps effort will make streamlined, efficient, and future-proof. Refer to Figure 1.3:

Figure 1.3 – DataOps tools augmenting data management tasks

It is clear that DataOps adds a lot of value to legacy data management processes to enable a future where new capabilities are made possible. In the following quote, you can see how modern DataOps processes will enable faster development:

A reference customer quoted that they were able to do 120 releases a month by adopting a DataOps tool that was suitable for their environment, as opposed to just one release every three months a year ago. (Gartner, 2022, {https://packt-debp.link/41DfFu})

Summary

In this overview of the business problem, you have learned a number of foundational elements that will be elaborated on in subsequent chapters. This chapter introduced the topics needed to gain an understanding of the current state of data engineering and the creation of future-proof designs. You have learned that businesses are faced with an ever-changing technological landscape. Competition requires one to innovate at scale to remain relevant. This causes a constant implementation stream of total-cost-of-ownership (TCO) budget allocations for refactoring and re-envisioning during what would normally be a run/manage phase of a system’s lifespan. In this chapter, and in subsequent chapters, we make many references to the engineering solution’s TCO. These references will be reminders to all stakeholders that the solutions developed are within the real world business setting. They are not created in some abstract vacuum, devoid of budgeting constraints that will, at times, limit possibilities. It is important to note that when the TCO is clear, yet constrained by budgets, these constraints repeatedly appear on the monthly and quarterly radar reports presented to the enterprise. These constraints will most likely have imposed risk. Without a constant stream of reminders, the business will forget how these constraints have impacted the solution.

Additionally, building a system that perpetuates false facts, even if spun as true facts, is foolish. Make the future data solution smart! We are entering an exciting future where data and information solutions will become smarter and support knowledge and intelligence capabilities. Embrace the change and know its implications on your data engineering choices. DataOps needs to be adopted by data professionals as a critical approach to managing data in today’s complex, data-driven world.

One size does not fit all and as such, building with data contracts in mind will force the development of data stores with the same logical data into the physical data architecture as fit-for-purpose parallel instantiations. Correctly building data solutions to be future-proof requires a vision, strategy, mission, and architectural approach to prevent the implementation from dying an untimely death due to the juggling needed to get the solution serviceable for the business.

Third-party vendors and cloud providers will produce well architected solutions that do not integrate, or worse yet, that foster architectural anti-patterns that must be avoided. As such, the data mesh and the cloud provider’s data fabrics are only buzzwords until the concepts are fully understood and rationalized into your architecture and organization’s objectives. Design data solutions consistently to the architecture you develop, develop use cases across the system, and test, regression test, and monitor them for continual service in order to maintain the trust established through data contracts.

Lastly, stay agile! Read! Learn! Be innovative! Once the big picture is grasped, the forward-looking perspective will grant you the foresight to look beyond the obstacles that will be encountered. You will be able to keep the data solution and its data fresh and current with a governed, agile architectural process.

In the next chapter, you will be presented with the architectural background challenges that build on this overview.

Frisian Oct 29, 2024

"Data Engineering Best Practices" by Richard Schiller is a down-to-earth guide for anyone serious about building data solutions that actually stand the test of time. Schiller dives into the real-world problems data engineers face like keeping up with rapid cloud migrations, juggling Agile processes, and prioritizing data privacy all while offering practical advice on how to do it right. With a mix of technical know-how and a thoughtful, big-picture approach, this book doesn’t just throw concepts at you. Instead, Schiller walks you through each stage of a project, making sure that the solutions are realistic, sustainable, and well-aligned with business goals.The book’s structure feels natural and easy to follow, starting from identifying core business challenges to laying out the best practices for building resilient architectures that can handle whatever the future brings. Schiller’s insights on data governance, machine learning, and adapting data strategies over time make it clear that he knows his stuff. His experience shows, and he makes even complex ideas feel within reach. For data engineers and IT professionals alike, this is the kind of book you’ll keep coming back to packed with ideas and examples that resonate well beyond the technical details.

Amazon Verified review

Amazon Customer Oct 29, 2024

I have found “Data Engineering Best Practices” to be a comprehensive and engaging walk through best practices necessary to best position an engineer for success. As stated in the beginning chapter, this book applies a hands on practical approach to apply software and data engineering practices to modern use cases. As a new Data Engineer, I must say I found this book engaging and well written. I learned much as I read through each chapter and found the engaging witticisms and manner of presentation to be helpful when trying to apply the technology to my prior experience. For example, "Don't expect to create the next state of the art tool" This type of advice presents the content in a down to earth way which draws me into wanting to accept the practices presented. The author’s friendly helpful tone keep it real for me. I would recommend this book to anyone seeking to clarify their solution engineering efforts while building up an expert understanding of data engineering best practices.

Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era

What do you get with eBook?

Data Engineering Best Practices

Overview of the Business Problem Statement

What is the business problem statement?

Anti-patterns to avoid

Anti-pattern #1 – Must we retain garbage?

Anti-pattern #2 – What does that column mean?

Patterns in the future-proof architecture

Future-proofing is …

Organization into zone considerations

Cloud limitations

The Intelligence Age

Use case definitions

The mission, the vision, and the strategy

Principles and the development life cycle

The architecture definition, best practices, and key considerations

The DataOps convergence

Summary

Page 1 of 5

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

About the authors

FAQs

Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

About the authors

FAQs