Problems with distributed systems
The design approach of distributed systems is by no means a silver-bullet and introduces new problems or magnifies existing ones. I am going to discuss some higher level problems and leave out the low level issues, such as transportation issues (package loss, network latency), to focus on the stack of a typical software engineer.
Complexity in design
As such systems consist of many endpoints, we have new challenges to worry about.
A broader set of skills
Bringing a distributed system to life requires extensive skills within the development team, as well as the operational team. Adding new dependencies to single services also needs a distributed understanding of the components involved, to keep the system vital and to be able to respond to requests in a reasonable time.
Testing
Before you ship a system it needs to be tested. Testing does not stop with a single service, but is done for the complete environment, which becomes challenging when you need to ensure the consistency of the environment for manual and automated testing. Differences between staging systems and live systems, such as different framework versions, can also be a problem.
A pragmatic approach in the long run is to incorporate monitoring to easily spot anomalies in the flow of operations, as bugs can have amplified repercussions on the system.
Rollout
The rollout of such a dynamic environment should be done by a fully automated deployment process, leaving as little room as possible for manual faults.
Operating overhead
Splitting a monolith into multiple processes may start with a certain number of service instances. When applying a failover protection or load balancing and messaging, it becomes a really challenging task to keep such a system running, as the number of instances can easily increase.
Tracing
In a distributed system, one can not simply solve an issue by inspecting a process. You will have multiple places for log files that need a correlative identifier to track down a request and its problems.
There are many solutions out there to help you to manage and centralize logging.
Contracts
To ensure valid communication between two services, you need a contract for a message format and a basic understanding of it. Any one-sided change to this contract will result in a break, therefore we need a coordinated way of releasing it.
A basic solution to this problem is the introduction of versions to the messages, which is basically a method to introduce backward compatibility to the system. As we all know, business sometimes calls for partial rollouts that render components out of sync and "versions" are no longer the magic bullet in such a case.
Issues at runtime
We might come across many issues while running our system that we need to learn from their huddles and perfect our system. Here are some of the problems we might face:
(Un)atomicity of operations
An operation in a distributed system is by no means guaranteed to be atomic, as it might be split into several subtasks that can be executed in parallel or sequentially across service borders.
This calls for a certain mechanism of distributed transactions, to revoke preceding actions when an essential subtask failed. This can also be achieved by queuing entities in a staged pool and releasing them to the live system when all the operations are successfully applied, or otherwise invalidate the changes.
A shared register
When multiple components share the same entity, such as credentials, there is a need to synchronize the register to have the same data available in multiple processes and to minimize hard faults, which fall back to a common database. Another issue originates in the asynchronous behavior of such systems, making it vulnerable to lost updates, which happens when component A and component B are updating the same entity.
If the components do not have a shared register but rather solve this issue by implementing synchronization, there is a need to introduce a notification upon changes.
Performance
Besides the performance of a single service, there's a natural overhead in the communication when you have to marshal the request and response instead of just working on a reference in the same process.
It is important to not base this process on blindfolded guessing when trying to resolve a bottleneck, which is wrong most of the time. It's better to base it on investigation even if it is hard to apply in a distributed system.
Methods for inspecting performance is covered in Chapter 4, Analyzing and Tuning a Distributed System.