Detecting failures and healing
Most software systems will fail despite the efforts spent on testing them. This suggests there are limits to what can be achieved by testing. These limitations stem from several facts about non-trivial systems. Any non-trivial system interacts with its environment, and it is simply not practical (and in many cases, not possible) to enumerate all possible environments in which the system will run. Also, it is usually possible to test a system to make sure it behaves as expected, but it is much harder to develop tests to make sure the system does not behave unexpectedly. Concurrency adds additional complexities: a program that was successfully tested for a particular scenario may fail for the same scenario when put into production.
In other words, no matter how much you test your programs, all sufficiently complex programs will eventually fail. So, it makes sense to architect systems for graceful failure and quick recovery. Part of this architecture...