Use case definitions
Question: Why should the development of any data engineering system be use-case-driven?
Answer: If one cannot develop a solution that integrates with the business’s needs, it is irrelevant; it can’t be communicated, nor can its efficacy be quantifiable.
Without use cases, a data processing system does not provide the tangible value required to keep up funding continuance. Even if a solution is the best thing since peanut butter, it will quickly devolve into an ugly set of post-mortem discussions when failures to meet expectations start to arise.
The solution needs to be built as a self-documenting, self-demonstrable collection of use cases that support a test-driven approach to delivery.
It’s part art and part science, but fully realizable with a properly focused system data architecture. Defining reference use cases and how they will support the architecture is a high bar to achieve. As the use cases are created and layered into the development plans as features of the solution, you must not get lost in the effort. To keep the focus, you need a vision, a strategy, and a clear mission.
The mission, the vision, and the strategy
You should begin with the mission and vision for which this overview section has laid a foundation. These should be aligned with the organization’s strategy, and if they are not… then alignment must be achieved. We will elaborate more on this in subsequent sections.
Principles and the development life cycle
Principles govern the choices made in the development of the business strategy defining the architecture, where the technologists apply art and science to fulfill the business needs. Again, alignment is required and necessary; otherwise, there will be difficulties when the first problems arise and they are not easily surmountable. The cost of making mistakes early is far greater than making errors later in the engineering development life cycle. The data engineering life cycle begins with architecture.
The architecture definition, best practices, and key considerations
The architecture can be developed in many ways, but what we as engineers, architects, and authors have discovered is that the core architecture deliverable needs to have three main components:
- A conceptual architecture
- A logical architecture
- A physical architecture
The conceptual architecture shows the business mission, vision, strategy, and principles, clearly implemented as an upward and outward facing communications tool. In the conceptual architecture’s definition, there will be a capabilities matrix that shows all the capabilities needed for your solution and these will be mapped deliverables. This will be the focus of Chapter 5, but for now, it is enough to know that the foundation of the solutions’ concepts will be your principles that are aligned with the vision, mission, and strategy of your business.
The logical architecture shows the software services and dataflows necessary to implement the conceptual architecture and ties the concepts to the logical implementation of those concepts. The physical architecture defines the deployable software, configurations, data models, ML models, and cloud infrastructure of the solution.
Our best practices and key considerations are drawn from years of experience with big data processing systems and start-ups in the areas of finance, health, publishing, and science. When working in those areas, projects included analytics of social media, health, and retail analytic data.
Use cases can be created using information contained in the logical as well as the physical architecture:
- Logical use cases:
- Software service flows
- Dataflows
- Physical use cases include:
- Infrastructural configuration information
- Operational process information
- Software component inventory
- Algorithm parameterization
- Data quality/testing definition and configuration information
- DevOps/MLOps/TestOps/DataOps trace information
Reusable design patterns are groupings of these use cases that have clean interfaces and are generic enough to be repurposed across data domains, therefore reducing the cost to develop and operate these patterns. With the simplification of the software design due to the smart data framework’s organization, use cases will coalesce into patterns easily. This will be an accelerator for software development in the future. Dataflows will be represented by these design patterns, which make them more than just static paper definitions. They will be operational design patterns that reflect the data journey through the data framework’s engineered solution that is aligned with the architecture.
The DataOps convergence
The data journey is a path from initial raw data ingestion through classification that ultimately positions transformed information for end user consumption. Curated, consumable data progresses through various zones of activity. These are going to be defined better in subsequent chapters, but the zones are bronze, silver, and gold. Datasets are curated in a data factory manner that is logically and physically grouped into these zones of activity. All custom built and configured data factory hosted data pipeline journeys utilize a data engineer’s standard process, which you will develop; otherwise, IT operations and the maintenance of service levels through agreeable contracts would be at risk. Data transformation and cataloging activities are centered around what others have coined DataOps.
DataOps, according to Gartner, is composed of five core capabilities:
- Orchestration capabilities involve the following:
- Connectivity
- Scheduling
- Logging
- Lineage
- Troubleshooting
- Alerting
- Workflow automation
- Observability capabilities enable the following:
- Monitoring of live or historic workflows
- Insights into workflow performance
- Cost metrics
- Impact analysis
- Environment management capabilities cover the following:
- Infrastructure as code (IaC)
- Resource provisioning
- Credential management
- IaC templates (for reuse)
- Deployment automation capabilities include the following:
- Version control
- Approvals
- Cloud CI/CD and pipelines
- Test automation capabilities provide the following:
- Validation
- Script management
- Data management
To illustrate how these DataOps principles can be applied, imagine a large retail company deploying an inventory management system. See Figure 1.2:
Figure 1.2 – Retail inventory management capabilities
Many third party vendors have jumped on the DataOps hype and produced fantastic tooling to jumpstart the convergence of DevOps, MLOps, and TestOps practices for modern cloud data systems.
The data engineering best practices of this book will also support the DataOps practices noted by Gartner while remaining neutral to the specific tooling choices. The focus will be on the data engineering framework that the DataOps effort will make streamlined, efficient, and future-proof. Refer to Figure 1.3:
Figure 1.3 – DataOps tools augmenting data management tasks
It is clear that DataOps adds a lot of value to legacy data management processes to enable a future where new capabilities are made possible. In the following quote, you can see how modern DataOps processes will enable faster development: