Troubleshooting Kubernetes
Troubleshooting Kubernetes involves diagnosing and resolving issues that affect the functionality and stability of your cluster and applications. Common errors may include problems with Pod scheduling, container crashes, image pull issues, networking issues, or resource constraints. Identifying and addressing these errors efficiently is crucial for maintaining a healthy Kubernetes environment.
In the upcoming sections, we’ll cover the essential skills you need to get started with Kubernetes troubleshooting.
Getting details about resources
When troubleshooting issues in Kubernetes, the kubectl get
and kubectl describe
commands are indispensable tools for diagnosing and understanding the state of resources within your cluster. You have already used these commands multiple times in the previous chapters; let us revisit the commands here again.
The kubectl get
command provides a high-level overview of various resources in your cluster, such as pods, services, deployments, and nodes. For instance, if you suspect that a pod is not running as expected, you can use kubectl get pods
to list all pods and their current statuses. This command will show you whether pods are running, pending, or encountering errors, helping you quickly identify potential issues.
On the other hand, kubectl describe
dives deeper into the details of a specific resource. This command provides a comprehensive description of a resource, including its configuration, events, and recent changes. For example, if a Pod from the previous command is failing, you can use kubectl describe pod todo-app
to get detailed information about why it might be failing.
This output includes the Pod’s events, such as failed container startup attempts or issues with pulling images. It also displays detailed configuration data, such as resource limits and environment variables, which can help pinpoint misconfigurations or other issues.
To illustrate, suppose you’re troubleshooting a deployment issue. Using kubectl get deployments
can show you the deployment’s status and number of replicas. If a deployment is stuck or not updating correctly, kubectl describe deployment webapp
will provide detailed information about the deployment’s rollout history, conditions, and errors encountered during updates.
In the next section, we will learn the important methods to find logs and events in Kubernetes to make our troubleshooting easy.
Kubernetes Logs and Events for troubleshooting
Kubernetes offers powerful tools like Events and Audit Logs to monitor and secure your cluster effectively. Events, which are cluster-wide resources of the Event kind, provide a real-time overview of key actions, such as pod scheduling, container restarts, and errors. These events help in diagnosing issues quickly and understanding the state of your cluster. You can view events using the kubectl get events
command:
$ kubectl get events
This command outputs a timeline of events, helping you identify and troubleshoot problems. To focus on specific events, you can filter them by resource type, namespace, or time period. For example, to view events related to a specific pod, you can use the following:
$ kubectl get events --field-selector involvedObject.name=todo-pod
Audit Logs, represented by the Policy kind, are vital for ensuring compliance and security within your Kubernetes environment. These logs capture detailed records of API requests made to the Kubernetes API server, including the user, action performed, and outcome. This information is crucial for auditing activities like login attempts or privilege escalations. To enable audit logging, you need to configure the API server with an audit policy. Refer to the Auditing documentation (https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/) to learn more.
When debugging Kubernetes applications, the kubectl logs
command is an essential tool for retrieving and analyzing logs from specific containers within a pod. This helps in diagnosing and troubleshooting issues effectively.
To fetch logs from a pod, the basic command is as follows:
$ kubectl logs todo-app
This retrieves logs from the first container in the pod. If the pod contains multiple containers, specify the container name:
$ kubectl logs todo-app -c app-container
For real-time log streaming, akin to tail -f
in Linux, use the -f
flag:
$ kubectl logs -f todo-app
This is useful for monitoring live processes. If a pod has restarted, you can access logs from its previous instance using the following:
$ kubectl logs todo-app --previous
To filter logs based on labels, combine kubectl
with tools like jq
:
$ kubectl get pods -l todo -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl logs {}
To effectively manage logs in Kubernetes, it’s crucial to implement log rotation to prevent excessive disk usage, ensuring that old logs are archived or deleted as new ones are generated. Utilizing structured logging, such as JSON format, makes it easier to parse and analyze logs using tools like jq.
Additionally, setting up a centralized logging system, like the Elasticsearch, Fluentd, Kibana (EFK) stack, allows you to aggregate and efficiently search logs across your entire Kubernetes cluster, providing a comprehensive view of your application’s behavior.
Together, Kubernetes Events and Audit Logs provide comprehensive monitoring and security capabilities. Events offer insights into the state and behavior of your applications, while Audit Logs ensure that all actions within the cluster are tracked, helping you maintain a secure and compliant environment.
kubectl explain – the inline helper
The kubectl explain
command is a powerful tool in Kubernetes that helps you understand the structure and fields of Kubernetes resources. Providing detailed information about a specific resource type allows you to explore the API schema directly from the command line. This is especially useful when writing or debugging YAML manifests, as it ensures that you’re using the correct fields and structure.
For example, to learn about the Pod resource, you can use the following command:
$ kubectl explain pod
This command will display a high-level overview of the Pod resource, including a brief description. To dive deeper into specific fields, such as the spec
field, you can extend the command like this:
$ kubectl explain pod.spec
This will provide a detailed explanation of the spec
field, including its nested fields and the expected data types, helping you better understand how to configure your Kubernetes resources properly.
Interactive troubleshooting using kubectl exec
Using kubectl exec
is a powerful way to troubleshoot and interact with your running containers in Kubernetes. This command allows you to execute commands directly inside a container, making it invaluable for debugging, inspecting the container’s environment, and performing quick fixes. Whether you need to check logs, inspect configuration files, or even diagnose network issues, kubectl exec
provides a direct way to interact with your applications in real time.
To use kubectl exec
, you can start with a simple command execution inside the container (you may use kubectl apply –f trouble/blog-portal.yaml
for testing):
$ kubectl get po -n trouble-ns
NAME READY STATUS RESTARTS AGE
blog-675df44d5-gkrt2 1/1 Running 0 29m
For example, to list the environment variables of a container, you can use the following:
$ kubectl exec blog-675df44d5-gkrt2 -- env
If the pod has multiple containers, you can specify which one to interact with using the -c
flag:
$ kubectl exec blog-675df44d5-gkrt2 -c blog -- env
One of the most common uses of kubectl exec
is to open an interactive shell session within a container. This allows you to run diagnostic commands on the fly, such as inspecting log files or modifying configuration files. You can start an interactive shell (/bin/sh
, /bin/bash
, etc.), as demonstrated here:
$ kubectl exec -it blog-675df44d5-gkrt2 -n trouble-ns -- /bin/bash
root@blog-675df44d5-gkrt2:/app# whoami;hostname;uptime
root
blog-675df44d5-gkrt2
14:36:03 up 10:19, 0 user, load average: 0.17, 0.07, 0.69
root@blog-675df44d5-gkrt2:/app#
Here, the following applies:
-i
: This is an interactive session.-t
: This allocates pseudo-TTY.
This interactive session is particularly useful when you need to explore the container’s environment or troubleshoot issues that require running multiple commands in sequence.
In addition to command execution, kubectl exec
supports copying files to and from containers using kubectl cp
. This can be particularly handy when you need to bring in a script or retrieve a log file for further analysis. For instance, here’s how to copy a file from your local machine into a container:
$ kubectl cp troubles/test.txt blog-675df44d5-gkrt2:/app/test.txt -n trouble-ns
$ kubectl exec -it blog-675df44d5-gkrt2 -n trouble-ns -- ls -l /app
total 8
-rw-r--r-- 1 root root 902 Aug 20 16:52 app.py
-rw-r--r-- 1 1000 1000 20 Aug 31 14:42 test.txt
And to copy a file from a container to your local machine, you’d need the following:
$ kubectl cp blog-675df44d5-gkrt2:/app/app.py /tmp/app.py -n trouble-ns
This capability simplifies the process of transferring files between your local environment and the containers running in your Kubernetes cluster, making troubleshooting and debugging more efficient.
In the next section, we will learn about ephemeral containers, which are very useful in Kubernetes troubleshooting tasks.
Ephemeral Containers in Kubernetes
Ephemeral containers are a special type of container in Kubernetes designed for temporary, on-the-fly tasks like debugging. Unlike regular containers, which are intended for long-term use within Pods, ephemeral containers are used for inspection and troubleshooting and are not automatically restarted or guaranteed to have specific resources.
These containers can be added to an existing Pod to help diagnose issues, making them especially useful when traditional methods like kubectl exec
fall short. For example, if a Pod is running a distroless image with no debugging tools, an ephemeral container can be introduced to provide a shell and other utilities (e.g., nslookup
, curl
, mysql
client, etc.) for inspection. Ephemeral containers are managed via a specific API handler and can’t be added through kubectl edit
or modified once set.
For example, in Chapter 8, Exposing Your Pods with Services, we used k8sutils
(quay.io/iamgini/k8sutils:debian12) as a separate Pod to test the services and other tasks. With ephemeral containers, we can use the same container image but insert the container inside the application Pod to troubleshoot.
Assume we have the Pod and Service called video-service
running in the ingress-demo
namespace (Refer to the ingress/video-portal.yaml
file for deployment details). It is possible to start debugging utilizing the k8sutils
container image as follows:
$ kubectl debug -it pod/video-7d945d8c9f-wkxc5 --image=quay.io/iamgini/k8sutils:debian12 -c k8sutils -n ingress-demo
root@video-7d945d8c9f-wkxc5:/# nslookup video-service
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: video-service.ingress-demo.svc.cluster.local
Address: 10.109.3.177
root@video-7d945d8c9f-wkxc5:/# curl http://video-service:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome</title>
<style>
body {
background-color: yellow;
text-align: center;
...<removed for brevity>...
In summary, ephemeral containers offer a flexible way to investigate running Pods without altering the existing setup or relying on the base container’s limitations.
In the following section, we will demonstrate some of the common Kubernetes troubleshooting tasks and methods.
Common troubleshooting tasks in Kubernetes
Troubleshooting Kubernetes can be complex and highly specific to your cluster setup and operations, as the list of potential issues can be extensive. Instead, let’s focus on some of the most common Kubernetes problems and their troubleshooting methods to provide a practical starting point:
- Pods are in Pending state: The error message
Pending
indicates that the pod is waiting to be scheduled onto a node. This can be caused by insufficient resources or misconfigurations. To troubleshoot, usekubectl describe pod <pod_name>
to check for events that describe why the pod is pending, such as resource constraints or node conditions. If the cluster doesn’t have enough resources, the pod will remain in the pending state. You can adjust resource requests or add more nodes. (Try usingtroubles/app-with-high-resource.yaml
to test this.) - CrashLoopBackOff or container errors: The
CrashLoopBackOff
error occurs when a container repeatedly fails to start, possibly due to misconfigurations, missing files, or application errors. To troubleshoot, view the logs usingkubectl logs <pod_name>
orkubectl describe pod <pod_name>
to identify the cause. Look for error messages or stack traces that can help diagnose the problem. If a container has an incorrect startup command, it will fail to start, leading to this error. Reviewing the container’s exit code and logs will help fix any issues. (Applytroubles/failing-pod.yaml
and test this scenario.) - Networking issues: These types of errors suggest that network policies are blocking traffic to or from the pod. To troubleshoot, you can check the network policies affecting the pod using
kubectl describe pod <pod_name>
, and verify service endpoints withkubectl get svc
. If network policies are too restrictive, necessary traffic might be blocked. For example, an empty ingress policy could prevent all traffic to a pod, and adjusting policies will allow the required services to communicate. (Usetroubles/networkpolicy.yaml
to test this scenario.) - Node not ready or unreachable: The
NotReady
error indicates that a node is not in a ready state due to conditions like network issues. To troubleshoot, check the node status withkubectl get nodes
andkubectl describe node <node_name>
. This error may also be caused by node taints that prevent scheduling. If a node has the taintNoSchedule
, it won’t accept pods until the issue is resolved or the taint is removed. - Storage issues: The PersistentVolumeClaim
Pending
error occurs when a persistent volume claim (PVC) is waiting for a matching persistent volume (PV) to be bound. To troubleshoot, check the status of PVs and PVCs withkubectl get pv
andkubectl get pvc
. For CSI, ensure thestorageClass
is configured properly and requested in the PVC definition accordingly. (Checktroubles/pvc.yaml
to explore this scenario.) - Service unavailability: The
Service Unavailable
error means that a service is not accessible, potentially due to misconfigurations or networking issues. To troubleshoot, check the service details usingkubectl describe svc <service_name>
. Verify that the service is correctly configured and points to the appropriate pods by using appropriate labels. If the service is misconfigured, it may not route traffic to the intended endpoints, leading to unavailability. You can verify the Service endpoints (Pods) using thekubectl describe svc <service_name>
command. - API server or control plane issues: These errors typically point to connectivity problems with the API server, often due to issues within the control plane or network. Since
kubectl
commands won’t work if the API server is down, you need to log in directly to the control plane server where the API server pods are running. Once logged in, you can check the status of the control plane components using commands likecrictl ps
(if you are using containerd) ordocker ps
(if you are using Docker) to ensure the API server Pod is up and running. Additionally, review logs and check the network connections to verify that all control plane components are functioning correctly. - Authentication and authorization problems: The
Unauthorized
error indicates issues with user permissions or authentication. To troubleshoot, verify user permissions withkubectl auth can-i <verb> <resource>
. For example, if a user lacks the required role or role binding, they will encounter authorization errors. Adjust roles and role bindings as needed to grant the necessary permissions. - Resource exhaustion: The
ResourceQuota
Exceeded
error occurs when a resource quota is exceeded, preventing the allocation of additional resources. To troubleshoot and monitor resource usage, usekubectl get quota
,kubectl top nodes
, andkubectl top pods
. If a quota is too low, it may block new resource allocations. Adjusting resource quotas or reducing resource usage can alleviate this issue. - Ingress or load balancer issues: The
IngressController
Failed error suggests that the ingress controller is not functioning correctly, impacting traffic routing. To troubleshoot, check the Ingress details usingkubectl describe ingress <ingress_name>
. Ensure that the ingress controller is properly installed and configured and that ingress rules correctly map to services. Misconfigurations in ingress rules can prevent proper traffic routing. Also, ensure the hostname DNS resolution is in place if you are using the optionalhost
field in the Ingress configuration.
This was the last practical demonstration in this book, so let’s now summarize what you have learned.