Fault tolerance and fault injection
The concept of fault tolerance – that is, the ability of an application, platform, or runtime to tolerate a systemic fault – by itself seems a simple enough concept to grasp. After all, you would expect an application to be able to gracefully recover if certain services were not available. In many cases, though, applications have been written with an understanding that the underlying infrastructure that hosts it is always available unless a catastrophic event occurs. While this reliability may be built into on-premises data centers and rarely challenged, the same assumption does not hold for cloud platforms, services, and components.
Though cloud platforms will offer certain Service-Level Agreements (SLAs) for uptime on some cloud services, there is always the possibility of a service-level or region-level outage that can come with no warning and vary widely in impact. Therefore, it is important to keep these types of outages in...