A computer cluster is a group of computers joined to work together in order to provide high availability of some services. The services are usually built with a number of applications operating in the so-called application layer. As shown in the example from the previous section, single points of failure are spread across all layers of a system design, and the application layer is one of the critical layers. It is usual for an application to encounter an error, or bug, and stop responding or crash. Such a situation will lead to service downtime and probably financial loss as well, so it is necessary to provide redundancy on the application layer also. This is the reason we need to implement a computer cluster solution into our high-availability system design.
A computer cluster consists of two or more computers. The computers are connected to the local area network. The maximum number of computers in a computer cluster is limited by the cluster software solution implemented but, in general, common cluster solutions support at least 16 cluster members. It is good practice for the cluster members to have the same hardware and specifications, which means that the cluster computers consist of components from the same manufacturer and likely have the same resource specifications.
There are two common types of computer clusters:
- Load balancing computer clusters
- High-availability computer clusters
Load balancing computer clusters are used to provide better and higher performance of services, and are typically used in science for complex scientific measurements and calculations. Interestingly, the same clusters are also used for websites and web servers facing extremely high load, which helps improve the overall response of the website with load distribution to different cluster nodes.
High-availability computer clusters strive to minimize the downtime of the service provided and not so much to improve the overall performance of the service. The focus of this book is on high-availability clusters. There are many different cluster configurations, mainly depending on the number of cluster members and also on the level of availability you want to achieve.
Some of the different cluster configurations are as follows:
- Active/Active: The Active/Active cluster configuration can be used with two or more cluster members. The service provided by the cluster is simultaneously active on all cluster nodes at any given time. The traffic can be passed to any of the existing cluster nodes if a suitable load balancing solution has been implemented. If no load balancing solution has been implemented, the Active/Active configuration can be used to reduce the time it takes to fail over applications and services from the failed cluster node.
- Active/Passive: The Active/Passive cluster configuration can be used with two or more cluster members. At a given time, the service is provided only by the current master cluster node. If the master node fails, automatic reconfiguration of the cluster is triggered and the traffic is switched to one of the operational cluster nodes.
- N + 1: The N over 1 cluster configuration can be used with two or more cluster members. If only two cluster members are available, the configuration degenerates to the Active/Passive configuration. The N over 1 configuration implies the presence of N cluster members in an active/active configuration with one cluster member in backup or hot standby. The standby cluster member is ready to take over any of the failed cluster node responsibilities at any given time.
- N + M: The N over M cluster configuration can only be used with more than two cluster members. This configuration is an upgrade of the N over 1 cluster configuration where N cluster members are in Active/Active state and M cluster members are in backup or hot standby mode. This is often used in situations where active cluster members manage many services and two or more backup cluster members are required to fulfill the cluster failover requirements.
- N-to-1: The N-to-1 cluster configuration is similar to the N over 1 configuration and can be used with two or more cluster members. If there are only two cluster nodes, this configuration degenerates to Active/Passive. In the N-to-1 configuration, the backup or hot standby cluster member becomes temporarily active for the time period of failed cluster node recovery. When the failed cluster node is recovered, services are failed over to the original cluster node.
- N-to-N: The N-to-N cluster configuration is similar to the N over M configuration and can be used with more than two cluster nodes. This configuration is an upgrade of the N-to-1 configuration and is used in situations where the need for extra redundancy is required on all active nodes.
The objective of a high-availability computer cluster is to provide uninterrupted and continuous availability of the service provided by the applications running in the cluster. There are many applications available that provide all sorts of services, but not every application can be managed and configured to work in a cluster. You need to make sure that the applications you choose to run in your computer cluster are easy to start, stop, and monitor. In this way, the cluster software can make instant application status checks, starts, and stops.
A computer cluster cannot exist without a shared storage solution. It is quite clear that a cluster is useless if the cluster members have access only to their local storage and data. The cluster members must have access to some kind of shared storage solution that ensures that all cluster members are able to access the same data. Cluster member access to the same storage provides consistency throughout the computer cluster. There are many shared storage solutions available, of which the following are the most commonly used in computer clusters today:
- Storage Area Network (SAN): This is a type of block-level data storage. It provides high-speed data transfers and Redundant Array of Inexpensive Disks (RAID) disk redundancy. The most commonly used SAN solutions are big storage racks of disk arrays with hundreds of disks and terabytes of disk space. These storage racks are connected to computers via optical fibers to achieve the highest possible performance. Computers can see the storage as locally attached storage devices and are free to create the desired file system on these devices.
- Network-attached Storage (NAS): This is a type of file-level data storage. With an NAS solution, the file system is predefined and forced on the client computers. NAS also provides RAID disk redundancy and the capability to expand to hundreds of disks and terabytes of disk space. As the name suggests, NAS storage is connected to computers through the network and provides access to data using file-sharing protocols such as Network File System (NFS), Server Message Block (SMB), and Apple Filling Protocol (AFP). Computers see NAS storage as network drives that they can map locally.
- Distributed Replicated Block Device (DRBD): This is probably the least expensive and most interesting solution. DRBD is a network-based mirrored disk redundancy. It is installed and configured on computers as a service and those computers can use their existing local storage to mirror and replicate data through the network. DRBD is a software solution that runs on the Linux platform, and was released in July 2007. It was quickly merged with the Linux kernel by 2009. It is probably the simplest and cheapest to implement into any system design as long as it is Linux-based.