Long chains of synchronous calls – the retry storm
When multiple microservices call each other repeatedly in a long chain, there is a possibility that a microservice might take more time to respond than expected, which can result in timeouts. These timeouts will initiate additional retry requests with an expectation that the operation might succeed and the flood of these retries will eventually make the system unusable. This scenario is known as a retry storm, as depicted in the following diagram:
In the preceding diagram, the checkout operation is performed by calling a series of microservices. Each microservice call has its own ETA with a sufficient buffer to address any adverse conditions. Though due to some reason, the invoice microservice is experiencing load and the ETA for the service is no longer the same. The other services are unaware of this change and are expecting the same ETA from...