Recovering a failed cluster node
Try as we might to architect and design a system that is resilient, always-on, and always available, we must also plan for recovery in the context of inevitable system failure. Typically, there are two prominent types of cluster failures: an irreparable failure such as a hardware component failure, or a reparable failure that could be a temporary system failure such as a system fault, an operating system error, or another hardware failure. However, every environment is different, and some are vastly more complex than others, so your mileage and requirements may vary.
For a general approach to recovering from failures that apply to most workloads, the following high-level steps can be followed to complete recovery in most cases:
- Identify the failed node and validate that the cluster roles have been moved to another available node.
- Locate the failed node, then pause and evict the node from the failover cluster configuration from a currently...