Troubleshooting Kubernetes clusters
Since Kubernetes is a distributed system that has been designed to tolerate failure where applications are run, most (but not all) issues tend to be centered on the control plane and API. A worker Node failing, in most scenarios, will just result in the Pods being rescheduled to another Node – though compounding factors can introduce issues.
In order to walk through common Kubernetes cluster issue scenarios, we will use a case study methodology. This should give you all the tools you need to investigate real-world cluster issues. Our first case study is centered on the failure of the API server itself.
Important note
For the purposes of this tutorial, we will assume a self-managed cluster. Managed Kubernetes services such as EKS, AKS, and GKE generally remove some of the failure domains (by autoscaling and managing master Nodes, for instance). A good rule is to check your managed service documentation first, as any issues may be...