Software Architecture Patterns for Serverless Systems

Defining Boundaries and Letting Go

In Chapter 1, Architecting for Innovation, we learned that the role of architecture is to enable change so that autonomous teams can confidently and continuously deliver business value. The key concept here is autonomy, but we need more than just autonomous teams. We need an architecture that promotes autonomy. We ultimately accomplish this by creating autonomous services with fortified boundaries.

But before we can fortify our boundaries, we need to define those boundaries. This is where our functional architecture emerges and dovetails with our technical architecture. We will define our boundaries at multiple levels, including the subsystem, service, and function levels. We define these boundaries up front so that autonomous teams can work in parallel with guard rails in place as they experiment within their boundaries.

In this chapter, you will learn how to define these boundaries and we will introduce our autonomous service patterns. Then we will dig into the details in the remaining chapters. But first, we will review the proven concepts and guiding principles that help us shape our boundaries so that they are meaningful and flexible.

At the tail end of the chapter, we will touch on the hardest thing of all for architects to do, which is to let go and trust the autonomous teams to deliver within their boundaries. Here, we will look at how we can govern without impeding innovation.

In this chapter, we’re going to cover the following main topics:

Learning the hard way
Building on proven concepts
Thinking about events first
Dividing a system into autonomous subsystems
Creating subsystem bulkheads
Dissecting an autonomous subsystem
Dissecting an autonomous service
Governing without impeding

Learning the hard way

You may be wondering why it is so important to define architectural boundaries. We all want to jump right in and start coding. But it is easy to get lost in the details of a new project, and we can find ourselves on a slippery slope if we do not set our bearings from the start.

I had a run-in with architecture early in my career that left an indelible impression on me. It was the 90s and n-tiered architecture was all the rage. I was the architect of a system with a middle tier that we wrote in C++ and ran on Tuxedo. This was well before Continuous Integration (CI) had emerged, so we lived by our nightly builds. One morning I arrived at work and found that the nightly build was still running. It was a large system with many subsystems, multiple services, and a significant quantity of code, but a nightly build that took all night was an indication that something was wrong. The system was still growing, so things would only get worse if we did not identify and remediate the root cause.

It didn’t take long to find the root cause, but the solution, although simple, would be tedious to roll out. In C++, we define classes in header files and include these files where we use the classes. However, there is no restriction on how many classes you can define in a header file. Our domain entities encapsulated our private persistence classes, but all these private classes were leaking, because we defined them in the same header files. We were building this private code over and over again, everywhere that we used the domain entities. As the system grew, the cost of this mistake became more and more expensive.

The SOLID principles did not exist at that time, but in essence, the system violated the Interface Segregation Principle. The header files contained more interfaces than necessary. The simple solution was to move these internal classes into private header files. We immediately began to implement all new features this way and a few strategic updates brought the builds back under control.

But the project was behind schedule, so the rest of the changes would have to wait. I took it upon myself to make those changes as time permitted. It took six months to retrofit the remainder of the system. This experience taught me the hard way about the importance of clean architectural boundaries.

More often than not, the hard way is the best way to learn. We need to discover the right solutions for our end users. We need to get our hands dirty and experiment to find out what works and what does not. But we need to define architectural boundaries at multiple levels that will act as guard rails as we perform our experiments.

As an industry, we have already learned a lot about what makes for good clean architecture. So, let’s look at some of the proven concepts we will be building on.

Building on proven concepts

If you are new to serverless computing, then you may be wondering how serverless architecture differs from more traditional architectures. Of course, there are differences when we get down to the details. But, by and large, serverless architecture builds on the same proven concepts that we should employ no matter what software we are writing.

We will use Domain-Driven Design (DDD), apply the SOLID principles, and strive to implement a clean Hexagonal Architecture. One of the more interesting differences, though, is that we will apply these concepts at multiple levels. We will apply them differently at the subsystem level (that is, macro architecture), the service level (that is, micro architecture), and the function level (that is, nano architecture).

Let’s review these concepts and see where we will apply them in our serverless architecture and how we will do so differently than we have in the past.

Domain-driven design

When I started my career, our industry was struggling with the paradigm shift to object-oriented programming. The Unified Modeling Language (UML) eventually emerged, but we all used it in different ways. The Gang of Four’s book Design Patterns, Gregor Hohpe’s and Bobby Woolf’s book Enterprise Integration Patterns, and Martin Fowler’s book Patterns of Enterprise Application Architecture helped us create better designs. But it wasn’t until Eric Evans’ book Domain-Driven Design took root that we all had common semantics that we could use to produce more familiar designs. Let’s review the main concepts of DDD that we will employ throughout this book: bounded context, domain aggregate, and domain event.

Bounded context

A bounded context defines a clear boundary between different areas of a domain. Within a bounded context, the domain model should be consistent and clear. Between bounded contexts, we use context maps to define the relationships between the models. Defining these boundaries is the main thrust of this chapter. We will be using this concept to define our autonomous subsystems and therefore our cloud accounts as well. In Chapter 7, Bridging Intersystem Gaps, we will learn how to create an anti-corruption layer to implement the context mapping between subsystems (that is, bounded contexts).

Domain aggregate

A domain aggregate is a top-level domain entity that groups related entities together in a cohesive unit. We will exchange data between services and subsystems at this aggregate level. We will cover data modeling and introduce the idea of data life cycle architecture in Chapter 5, Turning the Cloud into the Database. We will introduce the Backend For Frontend (BFF) pattern later in this chapter and in Chapter 6, A Best Friend for the Frontend, we will see how each BFF service owns a domain aggregate at a specific phase in the data life cycle.

Domain event

A domain event represents a change in state of a domain aggregate. These events are fundamental in our event-driven architecture and the heart of our autonomous services. We will cover the idea of event-first thinking in this chapter and event processing in Chapter 4, Trusting Facts and Eventual Consistency.

And of course, we will strive to use a ubiquitous language within each bounded context.

SOLID principles

SOLID is an acronym for a set of guiding design principles that help us implement clean, flexible, and maintainable software. The principles include the following:

Single Responsibility Principle
Open-Closed Principle
Liskov Substitution Principle
Interface Segregation Principle
Dependency Inversion Principle

Robert Martin first promoted these principles for improving object-oriented designs. Since then, these principles have proven equally valuable at the architectural level. Although the implementation of the principles varies at different levels, it is convenient to apply a consistent mental framework throughout. Let’s see how we can use each of these principles to guide the creation of evolutionary architectures that enable change.

Single Responsibility Principle

The Single Responsibility Principle (SRP) states that a module should be responsible to one, and only one, actor.

The SRP is simultaneously the most referenced and arguably the most misunderstood of all the SOLID principles. At face value, the name of this principle suggests a module should do one thing and one thing only. This misunderstanding leads to the creation of fleets of microservices that are very small, very interdependent, and coupled. Ultimately, this results in the creation of the microliths and microservice death stars that I mentioned in the first chapter. The author of this principle, Robert (Uncle Bob) Martin, has made an effort to correct this misunderstanding in his book Clean Architecture.

The original and common definition of this principle states a module should have one, and only one, reason to change. The operative word here is change. This is crucial in the context of architecture because the purpose of architecture is to enable change. However, when we apply this principle incorrectly, it tends to have the opposite effect, because highly interconnected modules impede change. The coupling increases the need for inter-team communication and coordination and thus slows down innovation.

In his latest definition of the SRP, Uncle Bob is focusing on the source of change. Actors (that is, people), drive changing requirements. These are the people that use the software, the stakeholders, and the owners of external systems that interact with the system, even the governing bodies that impose regulations. After all, people are the reason that software systems exist in the first place.

The goal of this principle is to avoid creating a situation where we have competing demands on a software module that will eventually tie it in knots and ultimately impede change. We achieve this goal by defining architectural boundaries for the different actors. This helps ensure that the individual modules can change independently with the whims of their actors. Uncle Bob also refers to this as the Axis of Change, where the modules on either side of the boundary change at different rates.

Later in this chapter, we will use the SRP to divide our software system into autonomous subsystems and to decompose subsystems into autonomous services. The architectural boundaries will also extend vertically across the tiers, so that the presentation, business logic, and data tiers can all change together.

Open-Closed Principle

The Open-Closed Principle (OCP) states that a module should be open for extension but closed for modification. Bertrand Meyer is the original author of this principle.

This principle highlights how human nature impacts a team’s pace of innovation. We naturally slow down when we modify an existing piece of software, because we have to account for all the impacts of that change. If we are worried about unintended side effects, then we may even resist change altogether. Conversely, adding new capabilities to a system is far less daunting. We are naturally more confident and willing to move forward when we know that the existing usage scenarios have been completely untouched.

An architecture that enables change is open to adding and experimenting with new features while leaving existing functionality completely intact to serve the current user base. With the SRP, we define the architectural boundaries along the axis of change to limit the scope of a given change to a single service. We must also close off other services to any inadvertent side effects by fortifying the boundaries with bulkheads on both sides.

Events will be our main mechanism for achieving this freedom. We can extend existing services to produce new event types without side effects. We can add new services that produce existing event types without impacting existing consumers. And we can add new consumers without changes to existing producers. At the presentation tier, micro frontends will be our primary mechanism for extension, which we will cover in Chapter 3, Taming the Presentation Tier.

Sooner or later, modification and refactoring of existing code is inevitable. When this is necessary, we will employ the Robustness principle to mitigate the risks up and down the dependency chains. We will cover common scenarios for extending and modifying systems with zero downtime in Chapter 11, Choreographing Deployment and Delivery.

Liskov Substitution Principle

The Liskov Substitution Principle (LSP) states that objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program. Barbara Liskov is the original author of this principle, hence the L in SOLID.

The substitution principle is essential to creating evolutionary architecture. Most innovations will consist of incremental changes. Yet, some will require significant changes and necessitate running multiple versions of a capability simultaneously. Following the LSP, we can substitute in new versions, so long as they fulfill the contracts with upstream and downstream dependencies.

The LSP will play a major role in the hexagonal architecture we will cover shortly. We will use events to define the contracts for inter-service communication. This design-by-contract approach enables the substitution that powers the branch-by-abstraction approach. We can substitute event producers and swap event consumers. We will leverage the LSP to strangle legacy applications as we modernize our systems and continue this evolutionary process indefinitely to support continuous innovation with zero downtime. But zero downtime requires an understanding of all the interdependencies, and this leads us to the Interface Segregation Principle.

Interface Segregation Principle

The Interface Segregation Principle (ISP) states that no client should be forced to depend on interfaces they do not use.

I provided an anecdote at the beginning of this chapter that highlighted the build-time issues that can arise when we violate the ISP. These build-time issues can have a big impact on a monolith. However, they are of less concern for our autonomous services because they are independently deployable units with their own CI/CD pipelines. We are still concerned about including unnecessary libraries because they can have an impact on cold start times and increase the risk of security vulnerabilities. But our real concern is on the deployment and runtime implications of violating the ISP and how it impacts downtime.

Our goal is to create an architecture that enables change so that we can reduce our lead times, deploy more often, and tighten the feedback loop. This requires confidence that our deployments will not break the system. Our confidence comes from an understanding of the scope and impact of any given change and our certainty that we have accounted for all the side effects and minimized the potential for unintended consequences. We facilitate this by creating clean and lean interfaces that we segregate from other interfaces to avoid polluting them with unnecessary concerns. This minimizes coupling and limits scope so that we can easily identify the impacts of a change and coordinate a set of zero-downtime deployments.

A common mistake that violates the ISP is the creation of general-purpose interfaces. The misperception is that reusing a single interface will accelerate development time. This may be true in the short term, but not in the long term. The increased coupling ultimately impedes innovation because of competing demands and the risk of unintended consequences.

This is a primary driver behind creating client-specific interfaces using the Backend for Frontend pattern. We will cover this pattern in detail in Chapter 6, A Best Friend for the Frontend.

For all our inter-service communication, our individual domain events are already nicely segregated, because we can change each of them independently. We do have to account for the fact that many downstream services will depend on these events. We will manage this by dividing the system into subsystems of related services and using internal domain events for intra-subsystem communication and external domain events for inter-subsystem communication. From here, we will leverage the Robustness principle to incrementally evolve our domain events. We will see this in play in Chapter 11, Choreographing Deployment and Delivery.

Even with well-segregated interfaces, we still need to avoid leaky abstractions. This occurs when details of specific upstream services are visible in the event payload, and we inadvertently use those details in services downstream. This leads us to the Dependency Inversion Principle.

Dependency Inversion Principle

The Dependency Inversion Principle (DIP) states that a module should depend on abstractions, not concretions.

At the code level, we also refer to the DIP as programming to an interface, not to an implementation. We also refer to it as Inversion of Control (IoC). It manifests itself in the concept of Dependency Injection (DI), which became an absolute necessity in monolithic systems. It eliminates cyclic dependencies, enables testing with mocks, and permits code to change and evolve by substituting implementations while holding interfaces closed to modification. We will use simple constructor-based DI in our serverless functions without the need for a heavy-weight framework.

At the architecture level, I think we can best understand the value of the DIP by using the scientific method as an analogy. Holding variables constant is a crucial component of the scientific method. When we perform a scientific experiment, we cannot derive anything from the results if nothing was held constant because it is impossible to determine cause and effect. In other words, some level of stability is a prerequisite for the advancement of knowledge. Stability in the face of change is the whole motivation behind autonomous services. We need the ability to continuously change and evolve a running system while maintaining stability.

The DIP is a fundamental principle for creating architecture that provides for flexibility and evolution while maintaining stability. For any given change to a service, we have to hold something constant to maintain the stability of the system. That constant in an event-first system is the domain events. When we modify any service, all others will rely on the stability of the event types they all share. This will control the scope and impact of the change so that teams have the confidence to move forward.

Translating the DIP to the architectural level, we get the following:

Domain events are the abstractions and autonomous services are the concretions.
Upstream services should not depend on downstream services and vice versa. Both should depend on domain events.
Domain events should not depend on services. Services should only depend on domain events.

In fact, many upstream services won’t know or care that a downstream service exists at all. Downstream services will depend on the presence of an upstream service because something has to produce a needed event, but they should not know or care which specific upstream service produced an event.

This ultimately means that upstream services are responsible for the flow of control and downstream services are responsible for the flow of dependencies. In other words, upstream services control when an event is produced and downstream services control who consumes those events. Hence, the name of the principle still applies at the architectural level. We are inverting the flow of dependency (that is, consumption) from the flow of control (that is, production).

This event-based collaboration will be most evident when we cover the Control Service Pattern in Chapter 8, Reacting to Events with More Events. These high-level services embody control flow policies and other services simply react to the events they produce.

Taking the DIP to the next level is the notion that downstream services react to domain events, and upstream services do not invoke downstream services. This is an inversion of responsibility that leads to more evolutionary systems. It allows us to build systems in virtually any order and gives us the ability to create end-to-end test suites that don’t require the entire system to be running or even completely implemented. This stems from the power of Event-First Thinking, which we will cover shortly. But first, let’s review the concepts of Hexagonal Architecture.

Hexagonal Architecture

Building on the SOLID principles, we need an architecture that allows us to assemble our software in flexible ways. Our software needs the ability to execute in different runtime environments, such as a serverless function, a serverless container, or a testing tool. We may decide to change a dependency, such as switching from one type of datastore to another, or even switching cloud providers. Most importantly, we need the ability to test our domain logic in isolation from any remote resources. To support this flexibility, we need a loosely coupled software architecture that hides these technical design decisions from the domain logic.

Alistair Cockburn’s Hexagonal Architecture is a popular approach for building loosely coupled modules. Robert Martin’s Clean Architecture and Jeffrey Palermo’s Onion Architecture are variations on this theme. An alternate name for this architecture is Ports and Adapters, because we connect modules through well-defined ports (that is, interfaces) and glue them together with adapter code. When we need to make a change, we leverage the SOLID principles, such as the DIP and LSP, and substitute different adapters.

Hexagonal architecture gets its name from the hexagonal shapes we use in the diagrams. We use hexagons simply because they allow us to represent systems as a honeycomb structure of loosely coupled modules. We read these diagrams from left to right. We refer to the left side as the primary or driving side because it defines the module’s input flow, while we call the right side the secondary or driven side since it defines the output flow. The inner hexagon represents the clean domain model with its input and output ports and the outer hexagon holds the inbound and outbound adapters.

Hexagonal architecture emerged well before serverless technology and the popularity of the cloud. Its origins align with the development of more monolithic software systems. Today, our serverless solutions consistent of fine-grained resources that we compose into distributed, event-driven systems. As a result, the shape of our systems has changed, but their nature remains the same, and therefore, hexagonal architecture is more important than ever.

We need to scale hexagonal architecture up to a macro perspective and down to a nano perspective. In our serverless architecture we will apply the hexagonal concepts at three different levels, the function, service, and subsystem levels, as depicted in Figure 2.1:

A diagram of a microcontroller Description automatically generated

Figure 2.1: Hexagonal architecture

This summary diagram features detailed diagrams for each of the three levels, Nano, Micro, and Macro, all displayed together so we can get a bird’s-eye view of how our serverless systems fit together. We need a mental picture of what is happening in a serverless function, how we combine these functions into autonomous services, and how we compose autonomous subsystems out of these services. We will see these diagrams again, individually, when we dig into the details. For now, let’s review summary descriptions of each level.

Note that you can find a legend for all the icons in the preface of the book.

Function-level (nano)

The function-level or nano hexagonal architecture describes the structure and purpose of the code within a serverless function. This level is most like traditional hexagonal architecture, but we scale it down to the scope of an individual serverless function. The handler and connector adapt the Model to the cloud services. We will dig deeper into this level in the Dissecting autonomous services section of this chapter.

Service-level (micro)

The service-level or micro hexagonal architecture describes the structure and purpose of the resources within an autonomous service. This level is less traditional because we are spreading the code across multiple serverless functions. The entities of the internal domain model live in a dedicated datastore so that we can share them across all the functions. The listener and trigger functions adapt the internal domain model to the domain events exchanged between services. The command and query functions adapt the model for frontend communication. We will dig deeper into the details of this level in the Dissecting autonomous services section later in this chapter.

Subsystem-level (macro)

The subsystem-level or macro hexagonal architecture describes the structure and purpose of the autonomous services within an autonomous subsystem. This level is different from traditional hexagonal architecture because the adapters are full services instead of just code artifacts. The Core of the subsystem is composed of Backend for Frontend (BFF) services and Control services that work together to implement the domain model and the internal domain events exchanged between these services define the ports of the model. The Ingress and Egress External Service Gateway (ESG) services adapt external domain events to internal domain events. We will cover external domain events in the Creating subsystem bulkheads section of this chapter. We will dig into the details of this level when we cover the ESG pattern in Chapter 7, Bridging Intersystem Gaps.

Now that we have cover the foundational concepts, let’s learn how event-first thinking helps us create evolutionary systems.

Thinking about events first

In the first chapter, we covered a brief history of software integration styles and the forces that impact lead times. We designed autonomous services to enable teams to maximize their pace of innovation because they give teams the confidence they need to minimize lead time and batch size. However, to deliver on this promise, we need to change the way we act, which means we need a different way of thinking.

We need to do the following:

Start with event storming
Focus on verbs instead of nouns
Treat events as facts instead of ephemeral messages
Turn APIs inside out by treating events as contracts
Invert responsibility for invocation
Connect services through an event hub

In other words, we need to think event-first. We can start to change our perspective by using a technique called event storming.

Start with event storming

Event storming is a workshop-oriented technique that helps teams discover the behavior of their business domain. It begins with brainstorming. The team starts by coalescing an initial set of domain events on a board using orange sticky notes. Next, we sequence the cards to depict the flow of events.

The following is a simplified example of the flow of events for a typical food delivery service. I will use and elaborate on this example throughout the book:

Figure 2.1 – Event storming—flow of events

Figure 2.2: Event storming – the flow of events

Along the way, the team will iteratively add more details using sticky notes of different colors, such as the following:

The command (blue) that performed the action that generated the event
The users (yellow) or external systems (pink) that invoked the command
The aggregate business domain (tan) whose state changed
The read-only data (green) that we need to support decision-making
Any policies (gray) that control behavior
The overall business process (purple) that is in play

Note that event storming is not a substitute for user stories and story mapping. User stories and story mapping are project management techniques for dividing work into manageable units and creating roadmaps. Event storming facilitates the discovery of user stories and the boundaries within our software architecture.

Focus on verbs instead of nouns

The flow of events discovered in the event-storming exercise clearly captures the behavior of the system. This event-first way of thinking is different because it zeroes in on the verbs instead of the nouns of the business domain. Conversely, more traditional approaches, such as object-oriented design, tend to focus on the nouns and create a class for each noun. However, when we focus on the nouns, we tend to create services that are resistant to change because they violate the SRP and ISP principles.

As an example, it is not uncommon to find a service whose single responsibility is everything to do with a single domain aggregate. These services will end up containing all the commands that operate on the data of the domain. However, as we discussed in the SOLID principles section, the SRP is intended to focus on the actors of the system. Different actors initiate different commands, which means that these noun-focused services ultimately serve many masters with competing demands. This will impede our ability to change these services when necessary.

Instead, we need to segregate the various commands across the different actors. By focusing on the verbs of the domain model, we are naturally drawn to creating services for the different actors that perform the actions. This eliminates the competing demands that add unnecessary complexity to the code and avoids coupling an actor to unneeded commands.

Of course, now that the actors are the focal point of our services, we will need a way to share the nouns (that is, domain aggregates) between services without increasing coupling. We need a record of truth. To address this, we first need to start thinking of events as facts, instead of just ephemeral messages.

Treat events as facts instead of ephemeral messages

Let’s recognize that when we think about events, we are focusing on the outputs of the system instead of the inputs. We are thinking in the past tense and thus we are focusing on the facts the system will produce over time. This is powerful in multiple ways.

It turns out we are implicitly building business analytics and observability characteristics into the system. For example, we can count the ViewedMenu events to track the popularity of the different restaurants and we can monitor the rate of PlacedOrder events to verify the health of the system.

We can also use this information to validate the hypothesis of each lean experiment we perform to help ensure we are building the right system and delivering on our business goals and objectives. In other words, event-first thinking facilitates observability mechanisms that help build team confidence and thus momentum.

However, to turn events into facts, we must treat them as first-class citizens instead of ephemeral messages. This is different from traditional messaging-based architectures, where we throw away the messages once we have processed them. We don’t want to treat events as ephemeral messages because we lose valuable information that we cannot easily recreate, if at all.

We will instead treat events as immutable facts and store them in an event lake in perpetuity. The event lake will act as the record of truth for the facts of the system. However, to make the record of truth complete we must think of events as contracts instead of mere notifications.

Turn APIs inside out by treating events as contracts

Many event-driven systems use events for notifications only. These anemic events only contain the identifier of the business domain entity that produced the event. Downstream services must retrieve the full data when they need it. This introduces coupling because it requires a synchronous call between the services. It may also create unwanted race conditions since the data can change before we retrieve it.

The usefulness of notification events as the record of truth is very limited because we will often refer to these facts far off in the future, well after the domain data has changed. To fully capture the facts, we need events to represent a snapshot in time of the state of the domain aggregate when the event occurred. This allows us to treat the facts as an audit log that is analogous to the transaction log of a database. This is a very powerful concept because a database uses the transaction log as the record of truth to manage the state and integrity of the database.

We are essentially turning the database inside out and creating a systemwide record of truth that we can leverage to manage the state and integrity of the entire system. For example, we can leverage the facts to transfer (that is, replicate or rebuild) the state of domain aggregates (that is, nouns) between services. This eliminates the need for aligning services around domain aggregates and results in an immensely scalable and resilient system.

This means that we are turning our APIs inside out by using events as the contracts between services. This also implies a guarantee of backward compatibility, and we will therefore create strong contracts between services within a subsystem and even stronger contracts between subsystems. At first glance, it may appear that this way of thinking will make the system more rigid. In reality, we are making the system more flexible and evolutionary by inverting responsibility to downstream services so they can react to events as they see fit.

Invert responsibility for invocation

The DIP, as we covered earlier in the chapter, was a major advancement in software design, because it decoupled high-level policy decisions from low-level dependency decisions. This gave teams the flexibility to substitute different implementations of the low-level components without breaking the logic in the high-level components. In other words, the DIP facilitated the use of the LSP and the OCP to make systems much more stable and flexible.

We elevated the DIP to the architectural level by using events as the abstraction (that is, contract) between autonomous services. This promotes the stability of the system when we modify a service because we are holding the contracts constant to control the scope and impact of any given deployment. But we gain more than just stability; we also gain flexibility. The use of events for inter-service communication gives rise to an inversion of responsibility that makes systems reactive. The best way to understand this improvement is to compare the old imperative approach to the new reactive approach.

The traditional, imperative approach to implementing systems is command focused. One component determines when to invoke another. For example, in our food delivery system, we would traditionally have the checkout functionality make a synchronous call to an order management service to invoke a command that submits the customer’s order. This means that we are coupling the checkout component to the presence of an order management service because it is responsible for the decision to invoke the command. This may not seem like a problem until we apply the same approach to retrieving driver status. We will need to invoke a service to retrieve driver status over and over again by any number of components and it will likely become a bottleneck.

Alternatively, we end up with a much more resilient and flexible system when we employ the reactive approach. The checkout component simply produces an OrderPlaced event. The order management service is now responsible for the decision to consume this event and react as it sees fit. The driver service simply produces DriverStatusChanged events when there is something useful to report. Any other service can take responsibility for reacting to driver events without impacting the driver service.

This inversion of responsibility is a key characteristic of autonomous services. It greatly reduces the complexity of the individual services because it reduces their responsibilities. A service is already aware of its own state, and it can simply produce events to reflect the changes without taking responsibility for what happens next. Downstream services take responsibility for how they react to upstream events. This completely decouples services from one another. They are all autonomous. This simplicity makes it much easier for teams to gauge the correctness and impact of any given change. Teams can be confident that the system will remain stable, if they uphold the contracts.

The reactive nature of event-first thinking is a paradigm shift, but the benefits are well worth the effort. A system becomes free to evolve in unforeseen ways by simply adding consumers. We can implement services in virtually any order because we can simulate upstream events and there is no coupling to downstream consumers. We gain the ability to create end-to-end test suites that don’t require other services to be running at the same time. The bottom line is that the reactive nature of autonomous services enables autonomous teams to react much more quickly to feedback as they learn from their experiments.

Connect services through an event hub

There is a myth that event-driven systems are much more complex, but this couldn’t be further from the truth. Event-first thinking allows us to create arbitrarily complex systems by connecting subsystems in a simple fractal topology as Figure 2.3 depicts:

Figure 2.3: Event-first topology

At the heart of our event-first architecture is the event hub. It connects everything together and pumps events through the system. Each subsystem has an event hub at the center, and each autonomous service connects to the event hub through well-defined ports (that is, events).

From a single service to many services in a subsystem, and from a single subsystem to many subsystems in a system, this simple pattern of connecting autonomous services and subsystems by producing and consuming events repeats ad infinitum. This flexibility frees us to build ever-evolving systems. We will dig into the details of the event hub in Chapter 4, Trusting Facts and Eventual Consistency, and we will see how to connect subsystems in Chapter 7, Bridging Intersystem Gaps.

Event-first is a very powerful approach but adopting this way of thinking can be a journey. Let’s continue that journey by learning how to divide a system into autonomous subsystems, then we will move on to the autonomous service patterns within each subsystem, and finally, we will dig into the anatomy of these services. Then we will be ready to bring all the details together throughout the remaining chapters.

Dividing a system into autonomous subsystems

The goal of software architecture is to define boundaries that enable the components of the system to change independently. We could dive straight down into defining the individual services of the system, but as the number of services grows the system will become unwieldy. Doing so contributes to the creation of microliths and microservice death stars. As architects, our job is to facilitate change, which includes architecting a manageable system.

We need to step back and look at the bigger picture. We must break a complex problem down into ever-smaller problems that we can solve individually and then combine into the ultimate solution. We need to divide the system into a manageable set of high-level subsystems that each has a single reason to change. These subsystems will constitute the major bounded contexts of the system. We will apply the SRP along different dimensions to help us arrive at boundaries that enable change. This will facilitate organizational scale with separate groups managing the individual subsystems.

We also need our subsystems to be autonomous, in much the same way that we create autonomous services. This will give autonomous organizations the confidence to continuously innovate within their subsystems. We will accomplish this by creating bulkheads between the subsystems. A system resembling the event-first topology depicted in Figure 2.3 will begin to emerge. The purpose of each subsystem will be clear, and the subsystem architecture will allow the system to evolve in a dynamic business environment.

Let’s look at some ways we can divide up a system.

By actor

A logical place to start carving up a system into subsystems is along the external boundaries with the external actors. These actors are the users and the external systems that directly interact with the system. Following the SRP, each subsystem might be responsible to one and only one actor.

In our event storming example earlier, we identified a set of domain events for a food delivery system. During the event storming workshop, we would also identify the users (yellow) and external systems (pink) that produce or consume those events, such as in Figure 2.4. In this example, we might have a separate subsystem for each category of the user: Customer, Driver, and Restaurant. We may also want a subsystem for each category of the external system, such as relaying orders to the restaurant’s ordering systems, processing payments, and pushing notifications to customers:

Figure 2.4: System context diagram

Of course, this is a simple example. Enterprise systems may have many kinds of users and lots of external systems, including legacy and third-party systems. In this case, we will need to look for good ways to organize actors into cohesive groups and these groups may align with the business units.

By business unit

Another good place to look for architectural boundaries is between business units. A typical organizational chart can provide useful insights. Each unit will ultimately be the business owner of its subsystems and thus they will have a significant impact on when and how the system changes.

Keep in mind that we are interested in the organization of the company, not the IT department. Conway’s Law states organizations are constrained to produce designs which are copies of their communication structure. We have seen that the communication structure leads to dependencies, which increases lead time and reduces the pace of innovation. So, we want to align each autonomous subsystem with a single business unit. We often refer to this approach as the Inverse Conway Maneuver.

However, the organizational structure of a company can be unstable. A company may reorganize its business units for a variety of reasons. So, we should look deeper into the work the business units actually perform.

By business capability

Ultimately, we want to draw our architectural boundaries around the actual business capabilities that the company provides. Each autonomous subsystem should encapsulate a single business capability or at most a set of highly cohesive capabilities.

Going back to our event-storming approach and our event-first thinking, we are looking for logical groupings of related events (that is, verbs). There will be high temporal cohesion within these sets of domain events. They will be initiated by a group of related actors that are working together to complete an activity. For example, in our food delivery example, a driver may interact with a dispatch coordinator to ensure that a delivery is successful.

The key here is that the temporal cohesion of the activities within a capability helps to ensure that the components of a subsystem will tend to change together. This cohesion allows us to scale the SRP to the subsystem level when there are many different actors. The individual services within a subsystem will be responsible to the individual actors, whereas a subsystem is responsible to a single set of actors that work together in a business process (purple) to deliver a business capability:

Figure 2.5: Capabilities subsystems

Figure 2.5 depicts our food delivery system from the capabilities perspective. It is similar to the system context diagram in Figure 2.4, but the functionality is starting to take shape. However, we may find more subsystems when we look at the system from the perspective of the data life cycle.

By data life cycle

Another place to look for architectural boundaries is along the data life cycle. Over the course of the life of a piece of data, the actors that use and interact with the data will change and so will their requirements. Bringing the data life cycle into the equation will help uncover some overlooked subsystems. We will usually find these subsystems near the beginning and the end of the data life cycle. In essence, we are applying the SRP all the way down to the database level. We want to discover all the actors that interact with the data so that we can find all the actors and isolate these sources of change into their own bounded contexts (that is, an autonomous subsystem).

Going back to event-first thinking, we are stepping back and taking a moment to focus on the nouns (that is, domain aggregates) so that we can discover more verbs (that is, domain events) and the actors that produce those events. This will help find what I refer to as slow data (green). We typically zero in on the fast data (tan). This is the transactional data in the system that actors are continuously creating. However, the transactional data often relies on reference data and government regulation may impose records management requirements on how long we must retain the transactional data. We want to decouple these sources of change so that they do not impact the flexibility and performance of the transactional and analytics data. We will cover this topic in detail in Chapter 5, Turning the Cloud into the Database:

Figure 2.6: Data life cycle subsystems

Figure 2.6 depicts subsystems from the data life cycle perspective. We will likely need a subsystem upstream that owns the master data model that all downstream subsystems use as reference data. And all the way downstream we will usually have an analytics subsystem and a records management subsystem. In the middle lies the transactional subsystems that provide the capabilities of the system, like we saw in Figure 2.5. We will also want to carve out subsystems for any legacy systems.

By legacy system

Our legacy systems are a special case. In Chapter 7, Bridging Intersystem Gaps, and Chapter 13, Don’t Delay, Start Experimenting, we will cover an event-first migration pattern for integrating legacy systems known as the Strangler pattern. Without going into detail here, we are creating an anti-corruption layer around the legacy systems that enable them to interact with the new system by producing and consuming domain events. This creates an evolutionary migration path that minimizes risk by keeping the legacy systems active and synchronized until we are ready to decommission them. This extends the substitution principle to the subsystem level because we can simply remove the legacy subsystem once the migration is complete.

We are essentially treating the legacy systems as an autonomous subsystem with bulkheads that we design to eliminate coupling between the old and the new and to protect the legacy infrastructure by controlling the attack surface and providing backpressure. We will use the same bulkhead techniques we are using for all subsystems: separate cloud accounts and external domain events.

Let’s look at how we can create these subsystem bulkheads next.

Creating subsystem bulkheads

What is the most critical subsystem in your system? This is an interesting question. Certainly, they are all important, but some are more important. For many businesses, the customer-facing portion of the system is the most important. After all, we must be able to engage with the customers to provide a service for them. For example, in an e-commerce system, the customer must be able to access the catalog and place orders, whereas the authoring of new offers is less crucial. So, we want to protect the critical subsystems from the rest of the system.

Our aim is to fortify all the architectural boundaries in a system so that autonomous teams can forge ahead with experiments, confident in the knowledge that the blast radius will be contained when teams make mistakes. At the subsystem level, we are essentially enabling autonomous organizations to manage their autonomous subsystems independently. Let’s look at how we can fortify our autonomous subsystems with bulkheads.

Separate cloud accounts

Cloud accounts form natural bulkheads that we should leverage as much as possible to help protect us from ourselves. Far too often we overload our cloud accounts with too many unrelated workloads, which puts all the workloads at risk. At a bare minimum, development and production environments must be in separate accounts. But we can do better by having separate accounts, per subsystem, per environment. This will help control the blast radius when there is a failure in one account. Here are some of the natural benefits of using multiple accounts:

We control the technical debt that naturally accumulates as the number of resources within an account grows. It becomes difficult, if not impossible, to see the forest for the trees, so to speak, when we put too many workloads in one account. The learning curve increases because the account is not clean. Tagging resources helps, but they are prone to omission. Engineers eventually resist making any changes because the risk of making a mistake is too high. The likelihood of a catastrophic system failure also increases.
We improve our security posture by limiting the attack surface of each account. Restricting access is as simple as assigning team members to the right role in the right account. If a breach does occur, then access is limited to the resources in that one account. In the case of legacy systems, we can minimize the number of accounts that have access to the corporate network, preferably to just one per environment.
We have less competition for limited resources. Many cloud resources have soft limits at the account level that throttle access when transaction volumes exceed a threshold. The likelihood of hitting these limits increases as the number of workloads increases. A denial-of-service attack or a runaway mistake on one workload could starve all other workloads. We can request increases to these limits, but this takes time. Instead, the default limits may provide plenty of headroom once we allocate accounts for individual subsystems.
Cost allocation is simple and error resistant because we allocate everything in an account to a single cost bucket without the need for tagging. This means that no unallocated costs occur when tagging is incomplete. We also minimize the hidden costs of orphaned resources because they are easier to identify.
Observability and governance are more accurate and informative because monitoring tools tag all metrics by account. This allows filtering and alerting by subsystem. When failures do occur, the limited blast radius also facilitates root cause analysis and a shorter mean time to recovery.

Having multiple accounts means that there are cross-cutting capabilities that we must duplicate across accounts, but we will address this shortly when we discuss automation in the Governing without impeding section.

External domain events

We have already discussed the benefits of using events as contracts, the importance of backward compatibility, and how asynchronous communication via events creates a bulkhead along our architectural boundaries. Now we need to look at the distinction between internal and external domain events.

Within a subsystem, its services will communicate via internal domain events. The definitions of these events are relatively easy to change because the autonomous teams that own the services work together in the same autonomous organization. The event definitions will start out messy but will quickly evolve and stabilize as the subsystem matures. We will leverage the Robustness principle to facilitate this change. The events will also contain raw information that we want to retain for auditing purposes but that is of no importance outside of the subsystem. All of this is OK because it is all in the family, so to speak.

Conversely, across subsystem boundaries, we need more regulated team communication and coordination to facilitate changes to these contracts. As we have seen, this communication increases lead time, which is the opposite of what we want. We want to limit the impact that this has on internal lead time, so we are free to innovate within our autonomous subsystems. We essentially want to hide internal information and not air our dirty laundry in public.

Instead, we will perform all inter-subsystem communication via external domain events (sometimes called integration events). These external events will have much more stable contracts with stronger backward compatibility requirements. We will intend for these contracts to change slowly to help create a bulkhead between subsystems. Domain-Driven Design (DDD) refers to this technique as context mapping, such as when we use domain aggregates with the same terms in multiple bounded contexts, but with different meanings.

External events represent the subsystem’s ports in hexagonal terminology. In Chapter 7, Bridging Intersystem Gaps, we will cover the External Service Gateway (ESG) pattern. Each subsystem will treat related subsystems as external systems. We will bridge the internal event hubs of related subsystems to create the event-first topology depicted in Figure 2.3. Each subsystem will define egress gateways that define what events it is willing to share and hide everything else. Subsystems will define ingress gateways that act as an anti-corruption layer to consume upstream external domain events and transform (that is, adapt) them to its internal formats.

Now that we have an all-important subsystem architecture with proper bulkheads, let’s look at the architecture within a subsystem. Let’s see how we can decompose an autonomous subsystem into autonomous services.

Dissecting an autonomous subsystem

At this point, we have divided our system into autonomous subsystems. Each subsystem is responsible to a single dominant actor who drives change. All subsystems are autonomous because they communicate via external domain events, and they are each housed in a separate cloud account that forms a natural bulkhead. This autonomy allows us to change the subsystems independently.

Now we are ready to start decomposing our subsystems into autonomous services. Again, the SRP plays a major role in defining the boundaries within a subsystem. First, we need to place a subsystem in context, then we will set up common components, and finally, we apply the major autonomous service patterns.

Context diagram

We will apply a set of autonomous service patterns to decompose a subsystem into services. These patterns cater to the needs of different categories of actors. So, we need to understand the context of an autonomous subsystem before we can decompose it into autonomous services. In other words, we need to know all the external actors that the subsystem will interact with. During event storming, we identified the behavior of the system and the users and external systems that are involved. Then we divided the system into autonomous subsystems so that each has a single axis of change.

A simple context diagram can go a long way to putting everyone on the same page regarding the scope of a subsystem. The context diagram enumerates all the subsystem’s external actors using yellow cards for users and pink cards for external systems. The diagram will contain a subset of the actors identified during event storming. We have encapsulated many of the original actors within other subsystems, so we will treat those subsystems as external systems. Figure 2.7 depicts the context of the Customer subsystem:

Figure 2.7: Subsystem context diagram

The Customer subsystem of our example Food Delivery System might have the following actors:

The Customer will be the user of this subsystem and the dominant actor.
The Restaurant Subsystem will publish external domain events regarding the restaurants and their menus.
A Payment Processor must authorize the customer’s payment method.
The subsystem will exchange OrderPlaced and OrderReceived external domain events with the Order Subsystem.
The Delivery Subsystem will publish external domain events about the status of the order.

Now that the context is clear, we can start decomposing the system into its frontend, services, and common components.

Micro frontend

Each autonomous subsystem is responsible to a single primary user or a single cohesive group of users. These users will need a main entry point to access the functionality of the subsystem. Each subsystem will provide its own independent entry point so that it is not subject to the changing requirements of another subsystem.

The user interface will not be monolithic. We will implement the frontend using autonomous micro-apps that are independently deployed. The main entry point will act as a metadata-driven assembly and menu system. This will allow each micro-app to have a different reason to change and help ensure that the frontend is not responsible for increasing lead times and impeding innovation.

We will cover the frontend architecture in detail in Chapter 3, Taming the Presentation Tier.

Event hub

Each autonomous subsystem will contain its own independent event hub, as depicted in Figure 2.8, to support asynchronous inter-service communication between the autonomous services of the subsystem. Services will publish domain events to the event hub as their state changes. The event hub will receive incoming events on a bus. It will route all events to the event lake for storage in perpetuity, and it will route events to one or more channels for consumption by downstream services:

A diagram of events and events Description automatically generated with medium confidence

Figure 2.8: Event hub

We will cover the event hub in detail in Chapter 4, Trusting Facts and Eventual Consistency. In Chapter 7, Bridging Intersystem Gaps, we will cover how to bridge the event hubs of different subsystems together to create the event-first topology depicted in Figure 2.3.

Autonomous service patterns

There are three high-level autonomous service patterns that all our services will fall under as depicted in Figure 2.9. At the boundaries of our autonomous subsystems are the Backend For Frontend (BFF) and External Service Gateway (ESG) patterns. Between the boundary patterns lies the Control service pattern. Each of these patterns is responsible to a different kind of actor, and hence supports different types of changes:

Figure 2.9: Service patterns

Backend For Frontend

The Backend For Frontend (BFF) pattern works at the boundary of the system to support end users. Each BFF service supports a specific frontend micro-app, which supports a specific actor.

In the Customer subsystem of our example Food Delivery System, we might have BFFs to browse restaurants and view their menus, sign up and maintain account preferences, place orders, view the delivery status, and view order history. These BFFs typically account for about 40% of the services in a subsystem.

A listener function consumes domain events from the event hub and caches entities in materialized views that support queries. The synchronous API provides command and query operations that support the specific user interface. A trigger function reacts to the mutations caused by commands and produces domain events to the event hub.

We will cover this pattern in detail in Chapter 6, A Best Friend for the Frontend.

External Service Gateways

The External Service Gateway (ESG) pattern works at the boundary of the system to provide an anti-corruption layer that encapsulates the details of interacting with other systems, such as third-party, legacy, and sister subsystems. They act as a bridge to exchange events between the systems.

In the Customer subsystem of our example Food Delivery System, we might have ESGs to receive menus from the Restaurant subsystem, forward orders to the Order subsystem, and receive the delivery status from the Delivery subsystem. The Order subsystem would have ESGs that integrate with the various order systems used by restaurants. The Delivery subsystem would have an ESG to integrate with a push notifications provider. These ESGs typically account for upwards of 50% of the services in a subsystem.

An egress function consumes internal events from the event hub and then transforms and forwards the events out to the other system. An ingress function reacts to external events in another system and then transforms and forwards those events to the event hub.

We will cover this pattern in detail in Chapter 7, Bridging Intersystem Gaps.

Control services

The Control Service pattern helps minimize coupling between services by mediating the collaboration between boundary services. These services encapsulate the policies and rules that are governed by the business owners. They are completely asynchronous. They consume events, perform logic, and produce new events to record the results and trigger downstream processing.

We use these services to perform complex event processing and orchestrate business processes. They leverage the systemwide event sourcing pattern and rely on the ACID 2.0 properties (Associative, Commutative, Idempotent, and Distributed). In the Delivery subsystem of our example Food Delivery System, we might have a control service that implements a state machine to orchestrate the delivery process under many different circumstances. Control services typically account for about 10% of the services in a subsystem.

A listener function consumes lower-order events from the event hub and correlates and collates them in a micro events store. A trigger function applies rules to the correlated events and publishes higher-order events back to the event hub.

We will cover this pattern in detail in Chapter 8, Reacting to Events with More Events.

Now, let’s look at the anatomy of an autonomous service.

Dissecting an autonomous service

Up to this point, we have discussed using the SRP as a guide for defining architectural boundaries that help ensure the system has the flexibility to change with the needs of the different actors. We have covered dividing a system into autonomous subsystems and decomposing an autonomous subsystem into autonomous services. Now we move on to the anatomy of an individual autonomous service.

Each autonomous team has the ultimate responsibility for making the decisions that are best for the services they own. Embracing a polyglot-everything mindset and empowering the teams to make these decisions gives them the freedom they need to maximize innovation. Still, every service needs a starting point to jump-start the process of discovery and continuous improvement. The following sections cover the common elements that go into the implementation of autonomous services.

You can find a service template here: https://github.com/jgilbert01/templates/tree/master/template-bff-service.

One of the most interesting things to note is that there is much more to a service than just its runtime code.

Repository

Each service has its own source code repository. This is due, in part, to the fact that modern distributed source control tools, such as Git, make it very easy to create and distribute new repositories. In addition, hosted offerings drive this point home by making the repository the focal point of their user experience. Furthermore, modern CI/CD pipelines tools assume that the repository is the unit of deployment. All these factors steer us towards this best practice.

Yet, the most important reason that each service has its own repository is autonomy. We want to drive down our lead times and sharing a repository with other teams will certainly cause friction and slow teams down. Separate repositories also act as bulkheads and shield teams from mistakes made by other teams.

They also protect us from ourselves in that we cannot accidentally create a dependency on the source code owned by another team, just because it is in the same repository. Instead, we must purposefully and explicitly create shared libraries, as discussed below, that will have their own repositories and release cycles.

CI/CD pipeline and GitOps

Each service has its own CI/CD pipeline, as defined by a configuration file in the root of its repository. Modern CI/CD pipelines enhance the concept of GitOps, which is the practice of using Git pull requests to orchestrate the deployment of infrastructure.

The pipeline hooks into the state changes of the repository and pull requests to trigger and coordinate CI/CD activities. Each push to a repository triggers CI tests to ensure that the code is behaving as expected. The creation of a pull request triggers deployment to the non-production environment and signals that the code is ready for review. Approval of a pull request triggers a production deployment.

This is a very powerful approach that becomes even stronger when combined with the concepts and practices of decoupling deployment from release, multiple levels of planning, task branch flow, and regional canary deployments. We will cover this in detail in Chapter 11, Choreographing Deployment and Delivery.

Tests

Automated testing plays a vital role in giving teams the confidence to continuously deploy. To this end, test cases make up the majority of the code base for a feature and unit tests make up the majority of the test cases. We execute unit tests, integration tests, contract tests, and transitive end-to-end tests in the CI/CD pipeline in isolation from all external resources. We will cover testing in Chapter 11, Choreographing Deployment and Delivery.

Stack

We deploy each service as a set of cloud resources that we will refer to as a stack. We declaratively define the resources of a stack in a serverless.yml configuration file in the root of the repository. We use the Serverless Framework to initiate deployments and execute the cloud provider’s deployment management service, such as AWS CloudFormation. The deployment management service manages the life cycle of the resources as a group. It compares the current state of the stack to the latest declarations and applies any changes. It adds new resources, updates existing resources, and deletes removed resources. And it deletes all resources when we delete the stack to ensure there are no orphaned resources. We use diagrams such as Figure 2.10 to depict the main resources in a service (that is, a stack):

Figure 2.10: A typical service stack

The gray box logically equates to the service and physically equates to both the repository and the stack. The icons represent the cloud resources within the stack. We place icons next to each other to indicate communication. This results in tighter diagrams. The nearest arrow implies the flow of communication. A legend for all the icons is available in the preface.

We will cover more details of creating a stack in Chapter 6, A Best Friend for the Frontend, and Chapter 11, Choreographing Deployment and Delivery.

Persistence

Each service will own and manage its own data. Following polyglot persistence practices, each service will use the type of database that best supports its needs. These serverless resources are managed as part of the stack.

Services will consume domain events from upstream services and cache the necessary data as lean materialized views. The high availability of these serverless data stores creates an inbound bulkhead that ensures necessary data is available even when upstream services are not. This also greatly improves data access latency.

We will leverage the Change Data Capture (CDC) mechanism of a data store to trigger the publishing of domain events when the state of the data changes. We can also use CDC to control the flow of data within a service.

We will cover the details of the persistence layer in Chapter 5, Turning the Cloud into the Database.

Trilateral API

Each service will have up to three APIs: one for the events it consumes, another for the events it produces, and one for its synchronous interface. Not all these interfaces are required. Most services will consume events, but not all will publish events. For example, a BFF service that provides a read-only view of data would only consume events, cache the data, and provide a synchronous API to access the data. Control services and most ESGs do not have a synchronous interface.

Events

Following our event-first approach, the APIs for events are the most important, because they dictate how a service will interact with other services. A service should document the events it consumes and those that it produces. This could be as simple as a listing in the README file at the root of the repository. We can document the JSON structure of internal domain events using TypeScript interface notation. For external domain events, a standard such as OpenAPI (https://www.openapis.org) or JSON Schema (https://json-schema.org) may be helpful. The cloud infrastructure may provide a registry service to capture the schemas of all the event types in the subsystem. You can also use a tool like Event Catalog (https://www.eventcatalog.dev) to make it easier to explore the producers and consumers in your subsystem. We will cover events in detail in Chapter 4, Trusting Facts and Eventual Consistency.

API Gateway

Services operating at the boundaries of the system, such as BFFs, will have a synchronous interface. We will implement these using an API Gateway.

We design the API of a BFF specifically for a single frontend micro-app. One team owns the frontend and the BFF, so official documentation for the API may not be necessary. However, we may use a self-documenting API, such as GraphQL. We will cover BFFs in detail in Chapter 6, A Best Friend for the Frontend.

Some ESG services will also require an API Gateway, such as implementing a webhook to receive events from a third-party system or providing an Open API for your own SaaS system. We will cover the ESG pattern in detail in Chapter 7, Bridging Intersystem Gaps.

Functions

We will implement the business logic of an autonomous service as serverless functions using the cloud provider’s Function-as-a-Service (FaaS) offering, such as AWS Lambda. It is important to point out that while each function is independent, we will manage functions as a group within the stack that owns all the resources of the service. To account for this distinction, we will use the hexagonal architecture that we introduced in the Building on proven concepts section.

We will architect our functions at two levels, the nano or function level and the micro or service level. In other words, we need to architect the code within each function and architect the functions within each service. Now let’s dig further into the nano and micro architecture of our serverless functions.

Nano architecture

At the function or nano level, we scale our hexagonal architecture down so that each serverless function cleanly executes a fragment of the business logic within an autonomous service. Figure 2.11 depicts the structure and purpose of the code within an individual serverless function.

Figure 2.11: Function-level – nano hexagonal architecture

The FaaS service (that is, lambda) invokes a serverless function and dictates the signature of the input parameters. We do not want this signature to pollute our business logic. In turn, our business logic makes the outbound calls to cloud services, such as a bus or datastore. Again, we do not want the signatures of these cloud services to pollute our business logic. So, we will separate the business logic into a model and isolate these dependencies in adapters.

We will implement the business logic (that is, the Model) as classes and functions that expose cloud-agnostic interfaces (that is, ports). We will implement a handler function that adapts the lambda signature to the model. The handler will inject the model with a connector class that adapts the model’s outbound calls to the signature of the cloud service, such as the bus.

This nano architecture will allow us to easily move the model to a different runtime environment or even another cloud provider by substituting different implementations for the handlers and connectors. In Chapter 11, Choreographing Deployment and Delivery, we will see how this architecture facilitates a serverless testing honeycomb.

Micro architecture

At the service or macro level, we scale our hexagonal architecture to show how multiple serverless functions work together to implement the business logic of the autonomous service. Figure 2.12 depicts the structure and purpose of the resources within an autonomous service.

Figure 2.12: Service-level – micro hexagonal architecture

This diagram presents an expanded format of the same BFF services depicted in condensed format in Figure 2.10 to highlight the different roles played by the various resources.

We store the internal domain model in a dedicated datastore (entities) so that we can share the data across all the functions. This data represents the internal domain model of the service, and that data format defines the interface (that is, ports). The serverless functions act as adapters to map between the internal model and the external model.

An autonomous service will typically have two to five serverless functions. Each function has its own nano architecture with the appropriate handler and connector based on the cloud services it interacts with, such as AWS Kinesis, AWS API Gateway, AWS DynamoDB, AWS DynamoDB Streams, or AWS EventBridge.

The listener function sits on the driving side because it consumes domain events from a channel and collects the data that is a prerequisite for the capabilities of the service. It adapts the data from the incoming domain events to the internal domain model and saves the results in the local entities datastore.

The command/query function sits on the driving side because it supports the user actions that drive work through the subsystem. The nano model implements the business logic for the queries and commands and the function adapts the model for frontend communication.

The trigger sits on the driven side because it reacts to the state changes within the service. It consumes internal change events from the datastore’s CDC stream. It adapts the data from the change events to the domain event model and publishes the results.

The nano and micro levels work together to keep our architecture clean and decoupled. They enable the Command, Publish, Consume, Query (CPCQ) pattern that we use to connect services together to carry out the work of an autonomous subsystem. Work starts upstream. A user performs an action, and a command function updates the state of the domain model. Then a trigger function publishes a domain event to record the fact that the action occurred and drive further processing. Downstream, one or more listener functions take responsibility for consuming the domain event and cache the needed data so that their users can query for the information they need to perform the next action. We repeat this simple pattern as many times as necessary to deliver the desired functionality of an autonomous subsystem.

We will see examples of stream processor functions in Chapter 4, Trusting Facts and Eventual Consistency. We will also see examples of REST and GraphQL functions in Chapter 6, A Best Friend for the Frontend.

Shared libraries

We will tend to shy away from using shared libraries for reusing business logic between services. This will help us avoid the false reuse that creates coupling and complexity. We will use open-source libraries for crosscutting concerns. Over time, duplicated logic can be refactored into libraries when it proves to not be false reuse. We will cover false reuse in Chapter 13, Don’t Delay, Start Experimenting.

Now that we have defined our architectural boundaries, let’s see how we can let go and govern without impeding innovation.

Governing without impeding

As architects, once we have defined the architectural boundaries of the system, we need to let go and get out of the way, unless we want to become an impediment to innovation. But letting go is difficult. It goes against our nature; we like to be hands-on. And it flies in the face of traditional governance techniques. But we must let go for the sake of the business, whether the business realizes this or not.

Governance has an understandable reputation for getting in the way of progress and innovation. Although it has good intentions, the traditional manual approach to governance actually increases risk, instead of reducing it, because it increases lead time, which diminishes an organization’s ability to react to challenges in a modern dynamic environment. But it doesn’t have to be this way.

We have already taken major strides to mitigate the risks of continuous innovation. We define architectural boundaries that limit the scope of any given change, and we fortify these boundaries to control the blast radius when honest human errors happen. We do this because we know that to err is human. We know that mistakes are inevitable, no matter how rigorous a governance process we follow.

Instead of impeding innovations, we must empower teams with a culture and a platform that embraces continuous governance. This is a safety net that gives teams and management confidence to move forward, knowing that we can catch mistakes and make corrections in real time. Automation and observability are the key elements of continuous governance. Let’s see how we can put this safety net in place and foster a culture of robustness.

Providing automation and cross-cutting concerns

A major objective of governance is to ensure that a system is compliant with regulations and best practices. These include the typical -ilities, such as scalability and reliability, and of course security, along with regulations such as NIST, PCI, GDPR, and HIPAA. The traditional approach includes manual audits of the architecture. These gates are the reason governance has a reputation for impeding progress. They are labor intensive and worse yet; they are error prone.

Fortunately, we now have a better option. Our deployments are fully automated by our CI/CD pipelines. This is already a significant improvement in quality because Infrastructure as Code reduces human error and enables us to quickly fail forward. We still have some manual gates for each deployment.

The first gate is code review and approval of a pull request. We perform this gate quickly because each task branch has a small batch size. The second gate is the certification of a regional canary deployment. We deploy to one region for continuous smoke testing before deploying to other regions. We will cover CI/CD pipelines in detail in Chapter 11, Choreographing Deployment and Delivery.

We also have observability, which provides timely, actionable information so that we know when to jump into action and we can recover quickly. We will cover this in Chapter 12, Optimizing Observability. We will take automation further and harden our build processes by adding continuous auditing and securing the perimeter of our subsystems and our cloud accounts. We will cover these topics in Chapter 10, Securing Autonomous Subsystems in Depth.

However, these are all cross-cutting concerns, and we don’t want teams to reinvent these capabilities for each autonomous subsystem. We need a dedicated team with the knowledge and specialized skills to manage an integrated suite of SaaS tools, stamp out accounts with a standard set of capabilities, and maintain these cross-cutting concerns for use across the accounts. Yet, the owners of each autonomous subsystem must have control over when to apply changes to their accounts and have the flexibility to override and/or enhance features as their circumstances dictate.

Even with these cross-cutting concerns in place, the reality is that many aspects of the approach and architecture are new and unfamiliar, so the next part of the governance equation is promoting a culture of robustness.

Promoting a culture of robustness

Our goal of increasing the pace of innovation leads us to a rapid feedback loop with small batch sizes and short lead times. We are deploying code much more frequently and these deployments must result in zero downtime. To eliminate downtime, we must uphold the contracts we have defined within the system. However, traditional versioning techniques fall apart in a dynamic environment with a high rate of change. Instead, we will apply the Robustness principle.

The Robustness principle states be conservative in what you send, be liberal in what you receive. This principle is well suited for continuous deployment, where we can perform a successive set of deployments to make a conforming change on one side of a contract, followed by an upgrade on the other side and then another on the first side to remove the old code. The trick is to develop a culture of robustness where this three-step dance is committed to team muscle memory and becomes second nature.

In Chapter 11, Choreographing Deployment and Delivery, we will cover a lightweight continuous delivery process that is geared for robustness. It includes three levels of planning, GitOps, CI/CD pipelines, regional canary deployment, and more. It forms a simple automated bureaucracy that governs each deployment but leaves the order of deployments completely flexible.

In my experience, autonomous teams are eager to adopt a culture of robustness, especially once they get a feel for how much more productive and effective they can become. But this is a paradigm shift, and it is unfamiliar from a traditional governance perspective. Everyone must have the confidence to move at this pace. As architects, we need to be evangelists and promote this cultural change, both upstream and downstream. We need to educate everyone on how everything we are doing comes together to provide a safety net for continuous discovery.

Finally, let’s see how metrics can guide governance.

Harnessing the four key team metrics

Observability metrics are an indispensable tool in modern software development. We cover this topic in detail in Chapter 12, Optimizing Observability. Autonomous teams are responsible for leveraging the observability metrics of their apps and services as a tool for self-governance and self-improvement. In my experience, teams truly value these insights and thrive on the continuous feedback.

From a systemwide governance perspective, we should focus our energy on helping teams that are struggling. In their book Measure Software Delivery Performance with Four Key Metrics (https://itrevolution.com/articles/measure-software-delivery-performance-four-key-metrics), Nicole Forsgren, Gene Kim, and Jez Humble put forth four metrics that we can harness to help us identify which teams may need more assistance and mentoring:

Lead time: How long does it take a team to complete a task and push the change to production?
Deployment rate: How many times a day is a team deploying changes to production?
Failure rate: How often does a deployment result in a failure that impacts a generally available feature?
Mean Time to Recovery (MTTR): When a failure does occur, how long does it take the team to fail forward with a fix?

The answers to these questions clearly indicate the maturity of a specific team. We certainly prefer lead time, failure rate, and MTTR to be low and deployment rate to be high. Teams that are having trouble with these metrics are usually going through their own digital transformation and are eager to receive mentoring and coaching. We can collect metrics from our issue-tracking software and CI/CD tool and track them alongside all the others in our observability tool.

Filter reviews by

All

Amazon verified reviews

Eugene C. Olsen Jul 31, 2024

Defines the why and the how of architecting, designing, and building robust, fault-tolerant applications, adapting time-honored patterns, principles, and practices, and adapting them to the Cloud. References many proven methodologies and puts them together in an innovative yet practical way. There is much meaty food for thought as well as instructions for the practical application of the patterns and principles. As has been mentioned in other reviews, there is also sample code available.Lacking experience in those time-honored principles, patterns, and methodologies? This book contains abundant references to resources that will bring you up to speed. The last chapter provides instructions to help organizations "dip their toes in the water" and get started quickly.With 40+ years in the software industry, I recognize that this book is full of better ways of thinking about architecture, design, and implementation of software applications. I look forward to designing and building my next application using the patterns described in this book.

Amazon Verified review

Henry Yeh Mar 22, 2024

This book is both an excellent intro into the world of serverless computing in the cloud, as well as a guide for experienced engineers who are attempting to modernize their stack. It’s a good reference with easy to find sections written in a way that’s not condescending to the technically minded.

ekeko Apr 28, 2024

Bei meiner Reise in die Welt der Cloud-Technologien hat sich dieses Buch als wertvolle Ressource erwiesen. Es präsentiert einen pragmatischen Ansatz, der auf den persönlichen Erfahrungen des Autors basiert und dem Inhalt eine menschliche Note verleiht. Besonders schätze ich die praktischen Beispiele und Vorlagen, die zur Verdeutlichung komplexer Konzepte beitragen.Ein bemerkenswerter Aspekt des Buches ist für mich die Einführung von FinOps, die sich mit dem finanziellen Aspekt von Cloud-Lösungen befasst. Obwohl dies ein wichtiges Thema ist, insbesondere für technische Fachleute, gestaltet sich die Handhabung manchmal schwierig.Zusammenfassend bietet dieses Buch wertvolle Einblicke in Cloud-Architekturen, indem es praktische Weisheit mit realen Erfahrungen kombiniert. Trotz einiger Einschränkungen bei den Sprachbeispielen (alle in JavaScript) bleibt es ein lohnenswertes Leseerlebnis für alle, die sich durch die Komplexität der Cloud-Technologien bewegen möchten.

ElmerA Apr 07, 2024

Much of my review stems from my background: I've been in software development for 25 years running and I'm involved with maintaining and evolving legacy monolithic systems. So I've gone through a good chunk of the history pretty much - from client-server, to web-based and now to cloud-native systems. Suffice to say I'm always in the lookout for material that shows the state-of-the-art and how systems are being designed, developed, tested and operationalized currently.It is rare to find a book that not just explains what or how an architecture is to be implemented but also the why. It is truly insightful to know how much of the pitfalls of the engineering trends and decisions made before have contributed to where we are today. People my age - who have done their share of development and might either be working as managers, enterprise architects, or department leaders - need to appreciate how critical it is for current systems to be designed to adapt to change (as cliche as that sounds) and how necessary the mindset is of architecting systems with change and speed as the driving principle. This book does an outstanding job of setting the reader's mindset by showing the whys through a walkthrough on the barriers that prevent us from shipping software fast and responsive to change.While the book presents the current approach (or one of possible approaches) in designing software, it doesn't shoehorn that down to the reader's throat and has actually presented the potential pitfalls of modern design (i.e. the microservices Death Star). This has been particularly refreshing for me to hear since all we hear about currently is how microservices is the way forward from monolithic and the Holy Grail of solving all of our development issues.I recommend this book for those who want an overview of modern software architecture design with a focus on serverless, with a sound exposition on how we got here and the reasons for doing so. It's difficult to find this kind of information out there as most jump immediately to implementation details.

Guy Harding May 03, 2024

As a software systems architect, there is a lot to wrap your head around when considering serverless. I've been designing and fixing software systems for 20+ years, but I had not done a deep dive into serverless until reading this book.The book explains why serverless is important today, details the principles & objectives of a serverless architecture, and provides a practical catalog of useful architectural patterns (even if you're not implementing a purely serverless approach).I've already applied several concepts to my current projects, and look forward to implementing several more as I build.I highly recommend for architects and engineers who want to understand serverless better and add serverless approaches to their tool chest.

Software Architecture Patterns for Serverless Systems: Architecting for innovation with event-driven microservices and micro frontends , Second Edition

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs