
Kubernetes Troubleshooting
- Published on
- Authors
- Author
- Ram Simran G
- twitter @rgarimella0124
As Kubernetes environments grow in complexity, having a systematic approach to troubleshooting becomes essential. This comprehensive guide catalogs every critical kubectl command you’ll need to diagnose and resolve issues in your Kubernetes clusters, organized by functional area.
Introduction
When production issues arise in Kubernetes environments, time is critical. This guide serves as a complete reference for DevOps engineers, platform teams, and SREs who need to quickly identify and resolve problems in their Kubernetes clusters. We’ll cover everything from basic cluster health checks to advanced debugging techniques, with practical examples for each command.
Checking Cluster Status
Begin your troubleshooting journey by assessing the overall health of your cluster:
# Show cluster details and endpoints
kubectl cluster-info
# List all nodes in the cluster
kubectl get nodes
# Show nodes with additional details
kubectl get nodes -o wide
# Get detailed information about a node
kubectl describe node nodename
# Show CPU & memory usage of nodes
kubectl top node The cluster-info command provides a quick overview of your control plane and core services. When nodes report “NotReady” status, use describe node to investigate the conditions section for clues. The top node command helps identify resource constraints that might affect pod scheduling.
Practical Example:
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
worker-node1 856m 42% 7892Mi 52%
worker-node2 1532m 76% 13948Mi 93% # High memory usage detected
worker-node3 215m 10% 2957Mi 19% In this example, worker-node2 shows high memory utilization (93%), which could cause pod evictions or prevent new pods from being scheduled.
Checking Pod Status
Once you’ve verified cluster health, examine your application pods:
# List all pods in the current namespace
kubectl get pods
# List all pods in all namespaces
kubectl get pods -A
# Show pods with node assignments
kubectl get pods -o wide
# Get detailed information about a pod
kubectl describe pod podname
# Show CPU & memory usage of pods
kubectl top pod
# Get full YAML definition of a pod
kubectl get pod podname -o yaml
# Show recent events for troubleshooting
kubectl get events --sort-by=.metadata.creationTimestamp The get pods command quickly surfaces problematic pods by status (CrashLoopBackOff, Error, Pending). The -o wide flag is invaluable as it shows node assignments, which helps correlate pod issues with potential node problems. Events are chronologically ordered with the --sort-by flag to see the most recent issues first.
Pod Status Troubleshooting Matrix:
| Status | Common Causes | First Commands to Run |
|---|---|---|
| Pending | Resource constraints, PVC binding issues | kubectl describe pod, kubectl get events |
| CrashLoopBackOff | Application errors, liveness probe failures | kubectl logs, kubectl describe pod |
| Error | Image pull failures, init container errors | kubectl describe pod, kubectl get events |
| Completed | Job or cronjob ran successfully | kubectl logs to verify expected output |
| Running but unhealthy | Readiness probe failures, application issues | kubectl describe pod, kubectl logs, kubectl exec |
Checking Logs
Logs are essential for diagnosing application-specific issues:
# Show logs for a pod's main container
kubectl logs podname
# Show logs for a specific container in a pod
kubectl logs podname -c containername
# Show logs from a previous crashed container
kubectl logs podname --previous
# Stream live logs from a pod
kubectl logs -f podname
# Show the last 100 log lines for all pods labeled "app=myapp"
kubectl logs -l app=myapp --tail=100 The --previous flag is particularly valuable when pods have restarted after a crash, letting you see what happened before the failure. For multi-container pods, the -c flag targets a specific container. The -f flag streams logs in real-time, which is useful for observing behavior during testing.
Advanced Log Analysis:
# Combine logs from multiple pods with the same label
kubectl logs -l app=frontend --tail=50 | grep ERROR
# Check logs within a specific time window
kubectl logs podname --since=1h
# Save logs to a file for analysis
kubectl logs deployment/api-service --all-containers=true > api-logs.txt Troubleshooting Pods
For deeper investigation of problematic pods:
# View full YAML definition of a pod
kubectl get pod podname -o yaml
# Get detailed information about a pod
kubectl describe pod podname
# Show cluster events to check failures
kubectl get events
# Show detailed pod conditions
kubectl get pod podname -o jsonpath='{.status.conditions[*]}'
# Open a shell inside a running pod
kubectl exec -it podname -c containername -- bash
# Open a shell in a specific container
kubectl exec -it podname -c containername -- sh
# Forward local port 8080 to a pod's port 80
kubectl port-forward podname 8080:80
# Delete a problematic pod to force a restart
kubectl delete pod podname The jsonpath option extracts specific fields from complex pod definitions, allowing targeted analysis. The exec command provides shell access within a container, essential for verifying environment variables, checking file systems, or testing connectivity from inside the pod.
Practical Debugging Example:
# Check if a pod can resolve DNS
kubectl exec -it frontend-pod -- nslookup backend-service
# Verify connectivity to another service
kubectl exec -it frontend-pod -- curl -v backend-service:8080
# Inspect file system permissions
kubectl exec -it database-pod -- ls -la /data/db Checking Deployments & ReplicaSets
Many issues originate at the deployment level:
# List all deployments
kubectl get deployments
# Get detailed deployment information
kubectl describe deployment deploymentname
# List all ReplicaSets
kubectl get rs
# Scale deployment to 3 replicas
kubectl scale deployment deploymentname --replicas=3
# Check deployment rollout status
kubectl rollout status deployment deploymentname
# Rollback to the previous deployment
kubectl rollout undo deployment deploymentname
# Restart a deployment
kubectl rollout restart deployment deploymentname Deployments manage ReplicaSets, which in turn manage Pods. Understanding this relationship is crucial for troubleshooting. The rollout commands help manage deployment lifecycle, with undo being particularly useful for quickly reverting problematic deployments.
Deployment Strategy Analysis:
# Check the deployment strategy
kubectl get deployment deploymentname -o jsonpath='{.spec.strategy}'
# See the history of rollouts
kubectl rollout history deployment deploymentname
# Compare specific revisions
kubectl rollout history deployment deploymentname --revision=2 Checking Services & Networking
Network connectivity issues are common in Kubernetes:
# List all services
kubectl get svc
# Get details of a specific service
kubectl describe svc servicename
# Show the actual pod IP behind a service
kubectl get endpoints servicename
# List all ingress resources
kubectl get ingress
# Get detailed ingress configuration
kubectl describe ingress ingressname
# List network policies
kubectl get networkpolicy
# Describe a specific network policy
kubectl describe networkpolicy policyname Services provide stable network identities for pods through abstract endpoints. When debugging connectivity, check if endpoints exist for your service (kubectl get endpoints) and verify that selectors match pod labels. Ingress resources configure HTTP/HTTPS routing, and network policies define allowed traffic flows.
Network Debugging Process:
- Verify service definition:
kubectl describe svc servicename - Check if endpoints exist:
kubectl get endpoints servicename - Verify pod labels match service selector:
kubectl get pods --show-labels - Test connectivity from within the cluster:
kubectl exec -it debug-pod -- curl servicename:port - Check network policies:
kubectl get networkpolicy
Checking Persistent Volumes & Storage
Storage issues can cause persistent application problems:
# List all Persistent Volume Claims (PVCs)
kubectl get pvc
# Describe a specific PVC
kubectl describe pvc pvcname
# List all Persistent Volumes (PVs)
kubectl get pv
# Get details of a specific PV
kubectl describe pv pvname Storage troubleshooting centers around the relationship between PVs (the actual storage) and PVCs (the request for storage). Common issues include mismatched access modes, capacity issues, or reclaim policy problems. When a pod is stuck in “ContainerCreating” with a mounted volume, investigate the PVC status first.
Storage Troubleshooting Checklist:
- PVC stuck in “Pending”: Check for available PVs, StorageClass issues, or capacity problems
- Pod can’t mount volume: Verify node has access to the storage backend
- Data persistence issues: Check reclaim policy, access modes, and filesystem permissions
- Performance problems: Investigate storage class parameters and underlying infrastructure
Debugging with Ephemeral Containers (Kubernetes 1.23+)
For advanced debugging in newer Kubernetes versions:
# Add a debugging container to a running pod
kubectl debug podname -it --image=busybox
# Debug a node using a container
kubectl debug node/nodename --image=busybox Ephemeral containers allow you to attach debugging tools to running pods without modifying their definition. This is particularly useful for minimal containers that don’t include shell or debugging utilities.
Example Debugging Session:
# Add a debugging container with network troubleshooting tools
kubectl debug nginx-pod -it --image=nicolaka/netshoot --target=nginx
# Now inside the debug container, you can run network diagnostics
$ ping google.com
$ tcpdump -i eth0
$ netstat -tuln Restarting Resources
Sometimes, the quickest fix is a restart:
# Delete a pod to force a restart
kubectl delete pod podname
# Restart a deployment
kubectl rollout restart deployment deploymentname
# Delete all pods in the namespace
kubectl delete pod --all While not always the most elegant solution, restarting resources can quickly resolve many issues, particularly those related to transient state or memory leaks. The rollout restart command performs a graceful rolling restart of all pods in a deployment.
Selective Restart Strategies:
# Restart only pods with a specific label
kubectl delete pods -l app=frontend
# Force deletion of stuck pods
kubectl delete pod stuck-pod --grace-period=0 --force
# Restart all deployments in a namespace
kubectl rollout restart deployment --all Checking and Debugging Kubernetes DNS
DNS issues are a common source of application connectivity problems:
# Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test DNS resolution inside a pod
kubectl exec -it podname -- nslookup myservice
# Use dig to test DNS resolution
kubectl exec -it podname -- dig myservice Kubernetes DNS (typically CoreDNS) provides service discovery. When applications can’t find services by name, check if CoreDNS pods are healthy, then test DNS resolution from within application pods.
DNS Troubleshooting Steps:
- Verify CoreDNS is running:
kubectl get pods -n kube-system -l k8s-app=kube-dns - Check CoreDNS logs:
kubectl logs -n kube-system -l k8s-app=kube-dns - Test basic DNS resolution:
kubectl exec -it test-pod -- nslookup kubernetes.default - Test service resolution:
kubectl exec -it test-pod -- nslookup myservice.namespace.svc.cluster.local - Check stubDomains and upstream nameservers in CoreDNS ConfigMap:
kubectl get configmap -n kube-system coredns -o yaml
Checking Role-Based Access Control (RBAC)
Authorization issues can prevent legitimate operations:
# Check if current user can list pods
kubectl auth can-i list pods
# Test permissions as another user
kubectl auth can-i delete deployments --as=admin
# List roles in the current namespace
kubectl get roles
# List cluster-wide roles
kubectl get clusterroles
# Get details about a RoleBinding
kubectl describe rolebinding myrolebinding
# Get details about a ClusterRoleBinding
kubectl describe clusterrolebinding myclusterrolebinding RBAC troubleshooting focuses on verifying that users and service accounts have appropriate permissions. The auth can-i command is particularly useful for testing specific permissions, while describe rolebinding shows the association between roles and subjects.
Common RBAC Patterns:
# Check what API resources exist in the cluster
kubectl api-resources
# Check what verbs (actions) can be performed on pods
kubectl api-resources --verbs=list --namespaced -o name | grep pods
# Verify service account permissions
kubectl auth can-i list pods --as=system:serviceaccount:default:default Troubleshooting API Server & Certificates
API server issues affect all Kubernetes operations:
# List certificate signing requests
kubectl get csr
# Check the API server health
kubectl get --raw /healthz
# Show the health of cluster components
kubectl get componentstatus
# Start a proxy to access API server directly
kubectl proxy The API server is the central management point for Kubernetes. Certificate issues often manifest as authentication failures, while API server health problems can cause widespread cluster disruption. The /healthz endpoint provides a quick way to verify API server status.
Certificate Management:
# Approve a certificate signing request
kubectl certificate approve csr-name
# Deny a certificate signing request
kubectl certificate deny csr-name
# View details of a CSR
kubectl get csr csr-name -o yaml Using K9s for Interactive Troubleshooting (If Installed)
For a more interactive troubleshooting experience:
# Start K9s CLI tool for Kubernetes management
k9s K9s provides a terminal-based UI that simplifies many Kubernetes operations. It offers real-time views of resources, logs, and metrics, with keyboard shortcuts for common actions.
K9s Key Features:
- Real-time resource monitoring
- Log streaming and searching
- Quick access to shell and describe operations
- Resource utilization views
- Port-forwarding capabilities
Advanced Troubleshooting Techniques
Beyond the basic commands, advanced troubleshooting may require specialized approaches:
Analyzing Resource Utilization:
# Get resource usage of all pods in a namespace
kubectl top pods
# Sort pods by CPU usage
kubectl top pods --sort-by=cpu
# Check resource requests and limits
kubectl describe pods | grep -A 3 "Requests|Limits" Auditing Cluster Events:
# Get all recent events, sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp
# Filter events by reason
kubectl get events --field-selector reason=Failed
# Watch events in real-time
kubectl get events --watch Inspecting Control Plane Logs (for self-managed clusters):
# SSH to control plane node and check component logs
sudo journalctl -u kubelet
sudo journalctl -u kube-apiserver
sudo journalctl -u kube-scheduler
sudo journalctl -u kube-controller-manager Comprehensive Troubleshooting Workflows
Let’s put these commands together into workflows for common scenarios:
Workflow 1: Application Pod Won’t Start
- Check pod status:
kubectl get pods - Examine pod details:
kubectl describe pod problematic-pod - Check events:
kubectl get events --sort-by=.metadata.creationTimestamp - If “Pending”: Check node resources with
kubectl describe nodes - If “ImagePullBackOff”: Verify image name, registry access, and pull secrets
- If “CrashLoopBackOff”: Check logs with
kubectl logs problematic-pod --previous
Workflow 2: Service Connectivity Issues
- Verify service exists:
kubectl get svc service-name - Check service details:
kubectl describe svc service-name - Verify endpoints exist:
kubectl get endpoints service-name - Check pod labels match service selector:
kubectl get pods --show-labels - Test DNS resolution:
kubectl exec -it test-pod -- nslookup service-name - Test direct connectivity:
kubectl exec -it test-pod -- curl pod-ip:port - Check network policies:
kubectl get networkpolicy
Workflow 3: Node Problems
- Check node status:
kubectl get nodes - Examine node details:
kubectl describe node problematic-node - Check node resource usage:
kubectl top node - Verify kubelet is running (on node):
systemctl status kubelet - Check kubelet logs (on node):
journalctl -u kubelet - Review pods on the node:
kubectl get pods --field-selector spec.nodeName=problematic-node
Workflow 4: Deployment Rollout Issues
- Check deployment status:
kubectl get deployment deployment-name - Examine rollout status:
kubectl rollout status deployment deployment-name - Check replica sets:
kubectl get rs | grep deployment-name - Examine pod status:
kubectl get pods | grep deployment-name - Check events:
kubectl get events --sort-by=.metadata.creationTimestamp - If needed, rollback:
kubectl rollout undo deployment deployment-name
Best Practices for Kubernetes Troubleshooting
- Start Broad, Then Narrow: Begin with high-level resource checks, then drill down to specifics.
- Check Events Early and Often: Events provide crucial chronological information about what’s happening in your cluster.
- Use Labels for Filtering: Leverage Kubernetes labels to filter resources and focus your troubleshooting.
- Create Dedicated Debug Pods: Deploy utility pods with debugging tools in affected namespaces.
- Understand Resource Hierarchies: Problems often cascade from parent resources (Deployments) to children (ReplicaSets, Pods).
- Maintain Historical Logs: Implement cluster-wide logging to retain historical context for troubleshooting.
- Document Common Issues: Create runbooks for recurring problems with step-by-step resolution procedures.
- Use Monitoring and Alerting: Proactively detect issues through comprehensive monitoring.
- Check Multiple Levels: Some issues appear in application logs, others in node journals or control plane components.
- Practice Chaos Engineering: Regularly test failure scenarios to improve troubleshooting skills.
Conclusion
Kubernetes troubleshooting is both an art and a science, requiring systematic investigation, deep platform knowledge, and practical experience. This comprehensive command reference provides the tools needed to diagnose and resolve most Kubernetes issues you’ll encounter in production environments.
By organizing your troubleshooting approach around resource types (clusters, nodes, pods, deployments, services, etc.) and following the workflows outlined here, you’ll be able to quickly identify the root cause of problems and implement effective solutions.
Remember that effective Kubernetes troubleshooting combines command-line expertise with a thorough understanding of how Kubernetes components interact. Use this guide as your companion in the journey toward Kubernetes operational excellence.
Cheers,
Sim