Dividing a system into autonomous subsystems
The goal of software architecture is to define boundaries that enable the components of the system to change independently. We could dive straight down into defining the individual services of the system, but as the number of services grows the system will become unwieldy. Doing so contributes to the creation of microliths and microservice death stars. As architects, our job is to facilitate change, which includes architecting a manageable system.
We need to step back and look at the bigger picture. We must break a complex problem down into ever-smaller problems that we can solve individually and then combine into the ultimate solution. We need to divide the system into a manageable set of high-level subsystems that each has a single reason to change. These subsystems will constitute the major bounded contexts of the system. We will apply the SRP along different dimensions to help us arrive at boundaries that enable change. This will facilitate organizational scale with separate groups managing the individual subsystems.
We also need our subsystems to be autonomous, in much the same way that we create autonomous services. This will give autonomous organizations the confidence to continuously innovate within their subsystems. We will accomplish this by creating bulkheads between the subsystems. A system resembling the event-first topology depicted in Figure 2.3 will begin to emerge. The purpose of each subsystem will be clear, and the subsystem architecture will allow the system to evolve in a dynamic business environment.
Let’s look at some ways we can divide up a system.
By actor
A logical place to start carving up a system into subsystems is along the external boundaries with the external actors. These actors are the users and the external systems that directly interact with the system. Following the SRP, each subsystem might be responsible to one and only one actor.
In our event storming example earlier, we identified a set of domain events for a food delivery system. During the event storming workshop, we would also identify the users (yellow) and external systems (pink) that produce or consume those events, such as in Figure 2.4. In this example, we might have a separate subsystem for each category of the user: Customer, Driver, and Restaurant. We may also want a subsystem for each category of the external system, such as relaying orders to the restaurant’s ordering systems, processing payments, and pushing notifications to customers:
Figure 2.4: System context diagram
Of course, this is a simple example. Enterprise systems may have many kinds of users and lots of external systems, including legacy and third-party systems. In this case, we will need to look for good ways to organize actors into cohesive groups and these groups may align with the business units.
By business unit
Another good place to look for architectural boundaries is between business units. A typical organizational chart can provide useful insights. Each unit will ultimately be the business owner of its subsystems and thus they will have a significant impact on when and how the system changes.
Keep in mind that we are interested in the organization of the company, not the IT department. Conway’s Law states organizations are constrained to produce designs which are copies of their communication structure. We have seen that the communication structure leads to dependencies, which increases lead time and reduces the pace of innovation. So, we want to align each autonomous subsystem with a single business unit. We often refer to this approach as the Inverse Conway Maneuver.
However, the organizational structure of a company can be unstable. A company may reorganize its business units for a variety of reasons. So, we should look deeper into the work the business units actually perform.
By business capability
Ultimately, we want to draw our architectural boundaries around the actual business capabilities that the company provides. Each autonomous subsystem should encapsulate a single business capability or at most a set of highly cohesive capabilities.
Going back to our event-storming approach and our event-first thinking, we are looking for logical groupings of related events (that is, verbs). There will be high temporal cohesion within these sets of domain events. They will be initiated by a group of related actors that are working together to complete an activity. For example, in our food delivery example, a driver may interact with a dispatch coordinator to ensure that a delivery is successful.
The key here is that the temporal cohesion of the activities within a capability helps to ensure that the components of a subsystem will tend to change together. This cohesion allows us to scale the SRP to the subsystem level when there are many different actors. The individual services within a subsystem will be responsible to the individual actors, whereas a subsystem is responsible to a single set of actors that work together in a business process (purple) to deliver a business capability:
Figure 2.5: Capabilities subsystems
Figure 2.5 depicts our food delivery system from the capabilities perspective. It is similar to the system context diagram in Figure 2.4, but the functionality is starting to take shape. However, we may find more subsystems when we look at the system from the perspective of the data life cycle.
By data life cycle
Another place to look for architectural boundaries is along the data life cycle. Over the course of the life of a piece of data, the actors that use and interact with the data will change and so will their requirements. Bringing the data life cycle into the equation will help uncover some overlooked subsystems. We will usually find these subsystems near the beginning and the end of the data life cycle. In essence, we are applying the SRP all the way down to the database level. We want to discover all the actors that interact with the data so that we can find all the actors and isolate these sources of change into their own bounded contexts (that is, an autonomous subsystem).
Going back to event-first thinking, we are stepping back and taking a moment to focus on the nouns (that is, domain aggregates) so that we can discover more verbs (that is, domain events) and the actors that produce those events. This will help find what I refer to as slow data (green). We typically zero in on the fast data (tan). This is the transactional data in the system that actors are continuously creating. However, the transactional data often relies on reference data and government regulation may impose records management requirements on how long we must retain the transactional data. We want to decouple these sources of change so that they do not impact the flexibility and performance of the transactional and analytics data. We will cover this topic in detail in Chapter 5, Turning the Cloud into the Database:
Figure 2.6: Data life cycle subsystems
Figure 2.6 depicts subsystems from the data life cycle perspective. We will likely need a subsystem upstream that owns the master data model that all downstream subsystems use as reference data. And all the way downstream we will usually have an analytics subsystem and a records management subsystem. In the middle lies the transactional subsystems that provide the capabilities of the system, like we saw in Figure 2.5. We will also want to carve out subsystems for any legacy systems.
By legacy system
Our legacy systems are a special case. In Chapter 7, Bridging Intersystem Gaps, and Chapter 13, Don’t Delay, Start Experimenting, we will cover an event-first migration pattern for integrating legacy systems known as the Strangler pattern. Without going into detail here, we are creating an anti-corruption layer around the legacy systems that enable them to interact with the new system by producing and consuming domain events. This creates an evolutionary migration path that minimizes risk by keeping the legacy systems active and synchronized until we are ready to decommission them. This extends the substitution principle to the subsystem level because we can simply remove the legacy subsystem once the migration is complete.
We are essentially treating the legacy systems as an autonomous subsystem with bulkheads that we design to eliminate coupling between the old and the new and to protect the legacy infrastructure by controlling the attack surface and providing backpressure. We will use the same bulkhead techniques we are using for all subsystems: separate cloud accounts and external domain events.
Let’s look at how we can create these subsystem bulkheads next.