Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Engineering Best Practices

You're reading from   Data Engineering Best Practices Architect robust and cost-effective data solutions in the cloud era

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781803244983
Length 550 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
David Larochelle David Larochelle
Author Profile Icon David Larochelle
David Larochelle
Richard J. Schiller Richard J. Schiller
Author Profile Icon Richard J. Schiller
Richard J. Schiller
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Chapter 1: Overview of the Business Problem Statement FREE CHAPTER 2. Chapter 2: A Data Engineer’s Journey – Background Challenges 3. Chapter 3: A Data Engineer’s Journey – IT’s Vision and Mission 4. Chapter 4: Architecture Principles 5. Chapter 5: Architecture Framework – Conceptual Architecture Best Practices 6. Chapter 6: Architecture Framework – Logical Architecture Best Practices 7. Chapter 7: Architecture Framework – Physical Architecture Best Practices 8. Chapter 8: Software Engineering Best Practice Considerations 9. Chapter 9: Key Considerations for Agile SDLC Best Practices 10. Chapter 10: Key Considerations for Quality Testing Best Practices 11. Chapter 11: Key Considerations for IT Operational Service Best Practices 12. Chapter 12: Key Considerations for Data Service Best Practices 13. Chapter 13: Key Considerations for Management Best Practices 14. Chapter 14: Key Considerations for Data Delivery Best Practices 15. Chapter 15: Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices 16. Chapter 16: Machine Learning Pipeline Best Practices and Processes 17. Chapter 17: Takeaway Summary – Putting It All Together 18. Chapter 18: Appendix and Use Cases 19. Index 20. Other Books You May Enjoy

Use case definitions

Question: Why should the development of any data engineering system be use-case-driven?

Answer: If one cannot develop a solution that integrates with the business’s needs, it is irrelevant; it can’t be communicated, nor can its efficacy be quantifiable.

Without use cases, a data processing system does not provide the tangible value required to keep up funding continuance. Even if a solution is the best thing since peanut butter, it will quickly devolve into an ugly set of post-mortem discussions when failures to meet expectations start to arise.

The solution needs to be built as a self-documenting, self-demonstrable collection of use cases that support a test-driven approach to delivery.

It’s part art and part science, but fully realizable with a properly focused system data architecture. Defining reference use cases and how they will support the architecture is a high bar to achieve. As the use cases are created and layered into the development plans as features of the solution, you must not get lost in the effort. To keep the focus, you need a vision, a strategy, and a clear mission.

The mission, the vision, and the strategy

You should begin with the mission and vision for which this overview section has laid a foundation. These should be aligned with the organization’s strategy, and if they are not… then alignment must be achieved. We will elaborate more on this in subsequent sections.

Principles and the development life cycle

Principles govern the choices made in the development of the business strategy defining the architecture, where the technologists apply art and science to fulfill the business needs. Again, alignment is required and necessary; otherwise, there will be difficulties when the first problems arise and they are not easily surmountable. The cost of making mistakes early is far greater than making errors later in the engineering development life cycle. The data engineering life cycle begins with architecture.

The architecture definition, best practices, and key considerations

The architecture can be developed in many ways, but what we as engineers, architects, and authors have discovered is that the core architecture deliverable needs to have three main components:

  • A conceptual architecture
  • A logical architecture
  • A physical architecture

The conceptual architecture shows the business mission, vision, strategy, and principles, clearly implemented as an upward and outward facing communications tool. In the conceptual architecture’s definition, there will be a capabilities matrix that shows all the capabilities needed for your solution and these will be mapped deliverables. This will be the focus of Chapter 5, but for now, it is enough to know that the foundation of the solutions’ concepts will be your principles that are aligned with the vision, mission, and strategy of your business.

The logical architecture shows the software services and dataflows necessary to implement the conceptual architecture and ties the concepts to the logical implementation of those concepts. The physical architecture defines the deployable software, configurations, data models, ML models, and cloud infrastructure of the solution.

Our best practices and key considerations are drawn from years of experience with big data processing systems and start-ups in the areas of finance, health, publishing, and science. When working in those areas, projects included analytics of social media, health, and retail analytic data.

Use cases can be created using information contained in the logical as well as the physical architecture:

  • Logical use cases:
    • Software service flows
    • Dataflows
  • Physical use cases include:
    • Infrastructural configuration information
    • Operational process information
    • Software component inventory
    • Algorithm parameterization
    • Data quality/testing definition and configuration information
    • DevOps/MLOps/TestOps/DataOps trace information

Reusable design patterns are groupings of these use cases that have clean interfaces and are generic enough to be repurposed across data domains, therefore reducing the cost to develop and operate these patterns. With the simplification of the software design due to the smart data framework’s organization, use cases will coalesce into patterns easily. This will be an accelerator for software development in the future. Dataflows will be represented by these design patterns, which make them more than just static paper definitions. They will be operational design patterns that reflect the data journey through the data framework’s engineered solution that is aligned with the architecture.

The DataOps convergence

The data journey is a path from initial raw data ingestion through classification that ultimately positions transformed information for end user consumption. Curated, consumable data progresses through various zones of activity. These are going to be defined better in subsequent chapters, but the zones are bronze, silver, and gold. Datasets are curated in a data factory manner that is logically and physically grouped into these zones of activity. All custom built and configured data factory hosted data pipeline journeys utilize a data engineer’s standard process, which you will develop; otherwise, IT operations and the maintenance of service levels through agreeable contracts would be at risk. Data transformation and cataloging activities are centered around what others have coined DataOps.

By 2025, a data engineering team guided by DataOps practices and tools will be 10 times more productive than teams that do not use DataOps. (2022 Gartner Market Guide for DataOps Tools, {https://packt-debp.link/6JRtF4})

DataOps, according to Gartner, is composed of five core capabilities:

  • Orchestration capabilities involve the following:
    • Connectivity
    • Scheduling
    • Logging
    • Lineage
    • Troubleshooting
    • Alerting
    • Workflow automation
  • Observability capabilities enable the following:
    • Monitoring of live or historic workflows
    • Insights into workflow performance
    • Cost metrics
    • Impact analysis
  • Environment management capabilities cover the following:
    • Infrastructure as code (IaC)
    • Resource provisioning
    • Credential management
    • IaC templates (for reuse)
  • Deployment automation capabilities include the following:
    • Version control
    • Approvals
    • Cloud CI/CD and pipelines
  • Test automation capabilities provide the following:
    • Validation
    • Script management
    • Data management

To illustrate how these DataOps principles can be applied, imagine a large retail company deploying an inventory management system. See Figure 1.2:

Figure 1.2 – Retail inventory management capabilities

Figure 1.2 – Retail inventory management capabilities

Many third party vendors have jumped on the DataOps hype and produced fantastic tooling to jumpstart the convergence of DevOps, MLOps, and TestOps practices for modern cloud data systems.

The data engineering best practices of this book will also support the DataOps practices noted by Gartner while remaining neutral to the specific tooling choices. The focus will be on the data engineering framework that the DataOps effort will make streamlined, efficient, and future-proof. Refer to Figure 1.3:

Figure 1.3 – DataOps tools augmenting data management tasks

Figure 1.3 – DataOps tools augmenting data management tasks

It is clear that DataOps adds a lot of value to legacy data management processes to enable a future where new capabilities are made possible. In the following quote, you can see how modern DataOps processes will enable faster development:

A reference customer quoted that they were able to do 120 releases a month by adopting a DataOps tool that was suitable for their environment, as opposed to just one release every three months a year ago. (Gartner, 2022, {https://packt-debp.link/41DfFu})

You have been reading a chapter from
Data Engineering Best Practices
Published in: Oct 2024
Publisher: Packt
ISBN-13: 9781803244983
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image