Hardware failures in Kubernetes can be divided into two groups:
- The node is unresponsive
- The node is responsive
When the node is not responsive, it can be difficult sometimes to determine if it's a networking issue, a configuration issue, or actual hardware failure. You obviously can't use any information like logs or run diagnostics on the node itself. What can you do? First, consider if the node was ever responsive. If it's a node that was just added to the cluster, it is more likely a configuration issue. If it's a node that was part of the cluster then you can look at historical data from the node on Heapster or central logging and see if you detect any errors in the logs or degradation in performance that may indicate failing hardware.
When the node is responsive, it may still suffer from the failure of redundant hardware, such as non...