Clustering in a Microsoft environment
The generic term clustering in computer terminology refers to any method that groups multiple computers together to provide a particular service. The common reason this is done is to introduce high availability and/or distribution of resources. For the purposes of this book, clustering leverages multiple physical computers to provide a hosting service for virtual machines. All of this is transparent to the consumers—both technological and human; the machines themselves and the clients that rely on them operate as though the cluster and virtualization components were non-existent. Users employ the exposed services no differently than they would if they were directly installed on a traditional physical deployment. An example of a user accessing a website hosted on a virtual machine is shown in the following image:
If you're coming to Hyper-V Server clustering with experience in another hypervisor technology, there are substantial differences right from the start. Chief among these is that a Hyper-V Server cluster is composed of two major technologies. The first technology is, obviously, Hyper-V Server. The second technology is Microsoft Failover Clustering. There is significant interplay and cooperation between the two, but they are distinct. This duality of technologies can lead unaware users to draw false conclusions and fall into traps based on incorrect assumptions. It can also cause confusion for newcomers to Hyper-V Server and Microsoft Failover Clustering.
Microsoft clusters are always considered failover clusters. A single virtual machine does not coexist across cluster nodes. All of the resources belonging to any given virtual machine are contained in or accessed through only one node at a time. The cluster system handles system failures by automatically moving—failing over—virtual machines from an ailing node to others that are still running. This does not necessarily mean that other cluster nodes are idle; they can run other virtual machines.
The basic process by which Microsoft Failover Clustering operates is somewhat node-centric. Each node is responsible for three basic resource types: roles, storage, and networks. A clustered role is a service being presented and protected by the cluster. Each virtual machine (and accompanying resources) is considered a role. A virtual machine and its details must be stored in a location common to all nodes; each node is responsible for maintaining connectivity to that storage. Finally, each node must have access to the same networks as the other nodes.
Because of the failover nature of the cluster, roles and storage have owners. At any given point in time, one node is responsible for each individual instance of these resource types. A virtual machine's owner is the physical host it is currently running on (or would be responsible for starting it if the virtual machine is offline). A storage location's owner is the physical host that is currently responsible for I/O to that location. A special storage type that will be discussed in much more detail later is the Cluster Shared Volume (CSV). Multiple nodes can communicate with a CSV simultaneously; however, it is still owned by only one node at a time (called a coordinator node). Networks do not have owners.
A failure does not always automatically result in a failover event. If a node has difficulty accessing storage or networks, there are various mitigation strategies it can take. If connectivity to a CSV is lost, it can reroute I/O through the coordinator node. If it is the coordinator node, it can transfer ownership to another node that can still reach the CSV. If a node loses connectivity on a cluster network but can still use others, it may be able to use those for cluster-related traffic.
If a failure that requires a node to stop participating in the cluster does occur, there are a few things that happen. First, if the node is still functional but detects a problem, it determines whether or not it can continue participating in the cluster. The primary failure that triggers this condition is loss of communications with the other nodes. If a node can no longer communicate with enough other nodes to maintain quorum (a concept that will be thoroughly discussed in Chapter 11, High Availability), it attempts to gracefully shut down its virtual machines so that their files can be accessed by other nodes. Ordinarily, quorum is achieved by 50 percent of the nodes plus one tiebreaker being active. The nodes that still have quorum may not be aware of why the node is missing, but they will notice that it is no longer reachable. They will begin attempting to start virtual machines from missing nodes almost immediately upon loss of connectivity. If there is no way for sufficient nodes to form a quorum, the entire cluster will stop and all clustered virtual machines will shut down.