When a node goes down
In a cluster of any significant size, nodes are bound to become unresponsive for a variety of reasons. Fortunately, Cassandra has a sophisticated mechanism called the failure detector that is designed to determine when this has occurred, then mark the node as down.
Most node failures result from temporary conditions, such as network issues. Therefore, Cassandra assumes the node will eventually come back online, and that permanent cluster changes will be executed explicitly using nodetool
.
Marking a downed node
Each node keeps track of the state of other nodes in the cluster by means of an accrual failure detector (or phi failure detector). This detector evaluates the health of other nodes based on a sliding window of gossip message arrival times. It computes the statistical distribution of those arrival times per node, thus taking into account the current state of the network rather than using naïve thresholds or timeouts.
The ultimate result of the failure detection algorithm...