You're reading from Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Product type Paperback

Published in Oct 2024

Publisher Packt

ISBN-13 9781803244983

Length 550 pages

Edition 1st Edition

Languages

SQL

Tools

Google cloud SQL

Concepts

Data Engineering

Authors (2):

David Larochelle

Richard J. Schiller

View More author details

Table of Contents (21) Chapters

Preface

1. Chapter 1: Overview of the Business Problem Statement FREE CHAPTER

2. Chapter 2: A Data Engineer’s Journey – Background Challenges

3. Chapter 3: A Data Engineer’s Journey – IT’s Vision and Mission

4. Chapter 4: Architecture Principles

5. Chapter 5: Architecture Framework – Conceptual Architecture Best Practices

6. Chapter 6: Architecture Framework – Logical Architecture Best Practices

7. Chapter 7: Architecture Framework – Physical Architecture Best Practices

8. Chapter 8: Software Engineering Best Practice Considerations

9. Chapter 9: Key Considerations for Agile SDLC Best Practices

10. Chapter 10: Key Considerations for Quality Testing Best Practices

11. Chapter 11: Key Considerations for IT Operational Service Best Practices

12. Chapter 12: Key Considerations for Data Service Best Practices

13. Chapter 13: Key Considerations for Management Best Practices

14. Chapter 14: Key Considerations for Data Delivery Best Practices

15. Chapter 15: Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices

16. Chapter 16: Machine Learning Pipeline Best Practices and Processes

17. Chapter 17: Takeaway Summary – Putting It All Together

18. Chapter 18: Appendix and Use Cases

19. Index

Why subscribe?

20. Other Books You May Enjoy

Use case definitions

Question: Why should the development of any data engineering system be use-case-driven?

Answer: If one cannot develop a solution that integrates with the business’s needs, it is irrelevant; it can’t be communicated, nor can its efficacy be quantifiable.

Without use cases, a data processing system does not provide the tangible value required to keep up funding continuance. Even if a solution is the best thing since peanut butter, it will quickly devolve into an ugly set of post-mortem discussions when failures to meet expectations start to arise.

The solution needs to be built as a self-documenting, self-demonstrable collection of use cases that support a test-driven approach to delivery.

It’s part art and part science, but fully realizable with a properly focused system data architecture. Defining reference use cases and how they will support the architecture is a high bar to achieve. As the use cases are created and layered into the development plans as features of the solution, you must not get lost in the effort. To keep the focus, you need a vision, a strategy, and a clear mission.

The mission, the vision, and the strategy

You should begin with the mission and vision for which this overview section has laid a foundation. These should be aligned with the organization’s strategy, and if they are not… then alignment must be achieved. We will elaborate more on this in subsequent sections.

Principles and the development life cycle

Principles govern the choices made in the development of the business strategy defining the architecture, where the technologists apply art and science to fulfill the business needs. Again, alignment is required and necessary; otherwise, there will be difficulties when the first problems arise and they are not easily surmountable. The cost of making mistakes early is far greater than making errors later in the engineering development life cycle. The data engineering life cycle begins with architecture.

The architecture definition, best practices, and key considerations

The architecture can be developed in many ways, but what we as engineers, architects, and authors have discovered is that the core architecture deliverable needs to have three main components:

A conceptual architecture
A logical architecture
A physical architecture

The conceptual architecture shows the business mission, vision, strategy, and principles, clearly implemented as an upward and outward facing communications tool. In the conceptual architecture’s definition, there will be a capabilities matrix that shows all the capabilities needed for your solution and these will be mapped deliverables. This will be the focus of Chapter 5, but for now, it is enough to know that the foundation of the solutions’ concepts will be your principles that are aligned with the vision, mission, and strategy of your business.

The logical architecture shows the software services and dataflows necessary to implement the conceptual architecture and ties the concepts to the logical implementation of those concepts. The physical architecture defines the deployable software, configurations, data models, ML models, and cloud infrastructure of the solution.

Our best practices and key considerations are drawn from years of experience with big data processing systems and start-ups in the areas of finance, health, publishing, and science. When working in those areas, projects included analytics of social media, health, and retail analytic data.

Use cases can be created using information contained in the logical as well as the physical architecture:

Logical use cases:
- Software service flows
- Dataflows
Physical use cases include:
- Infrastructural configuration information
- Operational process information
- Software component inventory
- Algorithm parameterization
- Data quality/testing definition and configuration information
- DevOps/MLOps/TestOps/DataOps trace information

Reusable design patterns are groupings of these use cases that have clean interfaces and are generic enough to be repurposed across data domains, therefore reducing the cost to develop and operate these patterns. With the simplification of the software design due to the smart data framework’s organization, use cases will coalesce into patterns easily. This will be an accelerator for software development in the future. Dataflows will be represented by these design patterns, which make them more than just static paper definitions. They will be operational design patterns that reflect the data journey through the data framework’s engineered solution that is aligned with the architecture.

The DataOps convergence

The data journey is a path from initial raw data ingestion through classification that ultimately positions transformed information for end user consumption. Curated, consumable data progresses through various zones of activity. These are going to be defined better in subsequent chapters, but the zones are bronze, silver, and gold. Datasets are curated in a data factory manner that is logically and physically grouped into these zones of activity. All custom built and configured data factory hosted data pipeline journeys utilize a data engineer’s standard process, which you will develop; otherwise, IT operations and the maintenance of service levels through agreeable contracts would be at risk. Data transformation and cataloging activities are centered around what others have coined DataOps.

By 2025, a data engineering team guided by DataOps practices and tools will be 10 times more productive than teams that do not use DataOps. (2022 Gartner Market Guide for DataOps Tools, {https://packt-debp.link/6JRtF4})

DataOps, according to Gartner, is composed of five core capabilities:

Orchestration capabilities involve the following:
- Connectivity
- Scheduling
- Logging
- Lineage
- Troubleshooting
- Alerting
- Workflow automation
Observability capabilities enable the following:
- Monitoring of live or historic workflows
- Insights into workflow performance
- Cost metrics
- Impact analysis
Environment management capabilities cover the following:
- Infrastructure as code (IaC)
- Resource provisioning
- Credential management
- IaC templates (for reuse)
Deployment automation capabilities include the following:
- Version control
- Approvals
- Cloud CI/CD and pipelines
Test automation capabilities provide the following:
- Validation
- Script management
- Data management

To illustrate how these DataOps principles can be applied, imagine a large retail company deploying an inventory management system. See Figure 1.2:

Figure 1.2 – Retail inventory management capabilities

Many third party vendors have jumped on the DataOps hype and produced fantastic tooling to jumpstart the convergence of DevOps, MLOps, and TestOps practices for modern cloud data systems.

The data engineering best practices of this book will also support the DataOps practices noted by Gartner while remaining neutral to the specific tooling choices. The focus will be on the data engineering framework that the DataOps effort will make streamlined, efficient, and future-proof. Refer to Figure 1.3:

Figure 1.3 – DataOps tools augmenting data management tasks

It is clear that DataOps adds a lot of value to legacy data management processes to enable a future where new capabilities are made possible. In the following quote, you can see how modern DataOps processes will enable faster development:

A reference customer quoted that they were able to do 120 releases a month by adopting a DataOps tool that was suitable for their environment, as opposed to just one release every three months a year ago. (Gartner, 2022, {https://packt-debp.link/41DfFu})