System design
IT systems that offer high availability of services must follow a specific system design approach by which they can provide the most available continuous operation. The fundamental rule of a high-availability system design approach is to avoid single points of failure. A single point of failure is a component of a system that could lead to system and service downtime if it fails. The design should avoid single points of failure, which makes the system more robust and automatically increases system and service availability.
A complex IT system providing application services can have a large number of single points of failure at different levels, but how can we eliminate all of them? The solution is redundancy. Redundancy means duplication of the system's critical components. Duplication of devices allows continuous system operation even if one of the duplicated devices fails. There are two types of redundancy: passive and active. Passive redundancy means using two or more devices while only one of them provides its service at certain point in time. The other devices wait to take over in the case of an unrecoverable failure of the operating device. Active redundancy means using two or more devices, all providing their service at all times. Even if one of the devices fails, other devices are continuously providing the service.
Let me try to explain single points of failure and redundancy with a practical example. Let's say you are hosting a simple website on your personal home computer. The computer is located at your home, hidden in your storage closet. It is happily providing a website service for end users. It is always on and users can access the website any time of the day. If you and I were ignorant, we could say that you are running a perfect solution with a perfect system design, especially since you are saving a fair amount of money, not paying for expensive hosting solutions at your local hosting service. But stop to think for a second and try to count the single points of failure in the system's design. Running a website on a personal computer that is not a dedicated server machine has a number of single points of failure to begin with. Personal computers are not designed to run continuously, mostly due to the fact that the hardware components of a personal computer are not duplicated and the redundancy requirements for high availability are not met.
If the hard drive on your personal computer fails, the system will crash and the website will experience serious downtime. The same goes for the computer's power supply. Unexpected failure of any of these components will bring the website down for anything ranging from an hour to days. The period of the downtime depends on the availability of the replacement component and the backup solution implemented. Another major issue with the system design in the provided example is the Internet Service Provider and the connection to the World Wide Web. Your personal computer is relying only on a single source to provide its Internet service and connection. If the Internet Service Provider, for some reason, suddenly experiences huge network problems and your Internet service goes down, the website will also experience serious downtime and you will be losing visitors—and possibly money—with every minute the website is unreachable. Again, the same goes for the electricity supply. You need to provide redundant components in every possible aspect, not just hardware. Redundancy must be provided at all layers, including the networking layer, power supply layer, and also the yet unmentioned application layer.
Nowadays the majority of modern server systems eliminate hardware single points of failure by duplicating hardware components, but this solution still falls short of eliminating single points of failure in applications, which is one of the main reasons for using computer cluster implementation. Application-layer redundancy is achieved with computer clusters. A computer cluster is a group of computers running cluster software that enables continuous two-way communication, also called a heartbeat, between cluster members. A heartbeat provides cluster members with information on the exact status of any cluster member at any given time. Practically, this means that any member of the cluster knows the exact number of the members in the cluster it is joined to and also knows which cluster members are active or online, in maintenance mode, offline, and many more aspects.