Reliability is becoming an increasingly popular topic in the world of distributed systems. Job postings for Site Reliability Engineers (SRE) or chaos engineers are becoming common, and as more and more organizations move toward cloud-native technologies, it's becoming impossible to ignore that system failure is always a reality. Networks will experience congestion, switches, other hardware components will fail, and a whole host of potential failure modes in systems will surprise us in production. It is impossible to completely prevent failures, so we should try to design our systems to be as tolerant of failure as possible. Â
Microservices provide interesting and useful opportunities to design for reliability. Because microservices encourage us to break our systems into services encapsulating single responsibilities, we can use a number of useful reliability...