High reliability
Reliability is a measure of the confidence in a system, and is inversely proportional to the probability of failure.
Reliability is measured using several metrics:
- Mean time between failures (MTBF): Uptime/number of failures
- Mean time to repair (MTTR): The average time it takes the team to fix a failure and return the system online
Testing for reliability
The easiest way to increase reliability is to increase test coverage of the system. This is, of course, assuming that those tests are meaningful tests.
Tests increase reliability by:
- Increasing MTBF: The more thorough your tests, the more likely you'll catch bugs before the system is deployed.
- Reducing MTTR: This is because historical test results inform you of the last version which passes all tests. If the application is experiencing a high level of failures, then the team can quickly roll back to the last-known-good version.