Governing without impeding
As architects, once we have defined the architectural boundaries of the system, we need to let go and get out of the way, unless we want to become an impediment to innovation. But letting go is difficult. It goes against our nature; we like to be hands-on. And it flies in the face of traditional governance techniques. But we must let go for the sake of the business, whether the business realizes this or not.
Governance has an understandable reputation for getting in the way of progress and innovation. Although it has good intentions, the traditional manual approach to governance actually increases risk, instead of reducing it, because it increases lead time, which diminishes an organization’s ability to react to challenges in a modern dynamic environment. But it doesn’t have to be this way.
We have already taken major strides to mitigate the risks of continuous innovation. We define architectural boundaries that limit the scope of any given change, and we fortify these boundaries to control the blast radius when honest human errors happen. We do this because we know that to err is human. We know that mistakes are inevitable, no matter how rigorous a governance process we follow.
Instead of impeding innovations, we must empower teams with a culture and a platform that embraces continuous governance. This is a safety net that gives teams and management confidence to move forward, knowing that we can catch mistakes and make corrections in real time. Automation and observability are the key elements of continuous governance. Let’s see how we can put this safety net in place and foster a culture of robustness.
Providing automation and cross-cutting concerns
A major objective of governance is to ensure that a system is compliant with regulations and best practices. These include the typical -ilities, such as scalability and reliability, and of course security, along with regulations such as NIST, PCI, GDPR, and HIPAA. The traditional approach includes manual audits of the architecture. These gates are the reason governance has a reputation for impeding progress. They are labor intensive and worse yet; they are error prone.
Fortunately, we now have a better option. Our deployments are fully automated by our CI/CD pipelines. This is already a significant improvement in quality because Infrastructure as Code reduces human error and enables us to quickly fail forward. We still have some manual gates for each deployment.
The first gate is code review and approval of a pull request. We perform this gate quickly because each task branch has a small batch size. The second gate is the certification of a regional canary deployment. We deploy to one region for continuous smoke testing before deploying to other regions. We will cover CI/CD pipelines in detail in Chapter 11, Choreographing Deployment and Delivery.
We also have observability, which provides timely, actionable information so that we know when to jump into action and we can recover quickly. We will cover this in Chapter 12, Optimizing Observability. We will take automation further and harden our build processes by adding continuous auditing and securing the perimeter of our subsystems and our cloud accounts. We will cover these topics in Chapter 10, Securing Autonomous Subsystems in Depth.
However, these are all cross-cutting concerns, and we don’t want teams to reinvent these capabilities for each autonomous subsystem. We need a dedicated team with the knowledge and specialized skills to manage an integrated suite of SaaS tools, stamp out accounts with a standard set of capabilities, and maintain these cross-cutting concerns for use across the accounts. Yet, the owners of each autonomous subsystem must have control over when to apply changes to their accounts and have the flexibility to override and/or enhance features as their circumstances dictate.
Even with these cross-cutting concerns in place, the reality is that many aspects of the approach and architecture are new and unfamiliar, so the next part of the governance equation is promoting a culture of robustness.
Promoting a culture of robustness
Our goal of increasing the pace of innovation leads us to a rapid feedback loop with small batch sizes and short lead times. We are deploying code much more frequently and these deployments must result in zero downtime. To eliminate downtime, we must uphold the contracts we have defined within the system. However, traditional versioning techniques fall apart in a dynamic environment with a high rate of change. Instead, we will apply the Robustness principle.
The Robustness principle states be conservative in what you send, be liberal in what you receive. This principle is well suited for continuous deployment, where we can perform a successive set of deployments to make a conforming change on one side of a contract, followed by an upgrade on the other side and then another on the first side to remove the old code. The trick is to develop a culture of robustness where this three-step dance is committed to team muscle memory and becomes second nature.
In Chapter 11, Choreographing Deployment and Delivery, we will cover a lightweight continuous delivery process that is geared for robustness. It includes three levels of planning, GitOps, CI/CD pipelines, regional canary deployment, and more. It forms a simple automated bureaucracy that governs each deployment but leaves the order of deployments completely flexible.
In my experience, autonomous teams are eager to adopt a culture of robustness, especially once they get a feel for how much more productive and effective they can become. But this is a paradigm shift, and it is unfamiliar from a traditional governance perspective. Everyone must have the confidence to move at this pace. As architects, we need to be evangelists and promote this cultural change, both upstream and downstream. We need to educate everyone on how everything we are doing comes together to provide a safety net for continuous discovery.
Finally, let’s see how metrics can guide governance.
Harnessing the four key team metrics
Observability metrics are an indispensable tool in modern software development. We cover this topic in detail in Chapter 12, Optimizing Observability. Autonomous teams are responsible for leveraging the observability metrics of their apps and services as a tool for self-governance and self-improvement. In my experience, teams truly value these insights and thrive on the continuous feedback.
From a systemwide governance perspective, we should focus our energy on helping teams that are struggling. In their book Measure Software Delivery Performance with Four Key Metrics (https://itrevolution.com/articles/measure-software-delivery-performance-four-key-metrics), Nicole Forsgren, Gene Kim, and Jez Humble put forth four metrics that we can harness to help us identify which teams may need more assistance and mentoring:
- Lead time: How long does it take a team to complete a task and push the change to production?
- Deployment rate: How many times a day is a team deploying changes to production?
- Failure rate: How often does a deployment result in a failure that impacts a generally available feature?
- Mean Time to Recovery (MTTR): When a failure does occur, how long does it take the team to fail forward with a fix?
The answers to these questions clearly indicate the maturity of a specific team. We certainly prefer lead time, failure rate, and MTTR to be low and deployment rate to be high. Teams that are having trouble with these metrics are usually going through their own digital transformation and are eager to receive mentoring and coaching. We can collect metrics from our issue-tracking software and CI/CD tool and track them alongside all the others in our observability tool.