Kubernetes Troubleshooting

As Kubernetes environments grow in complexity, having a systematic approach to troubleshooting becomes essential. This comprehensive guide catalogs every critical kubectl command you’ll need to diagnose and resolve issues in your Kubernetes clusters, organized by functional area.

Introduction

When production issues arise in Kubernetes environments, time is critical. This guide serves as a complete reference for DevOps engineers, platform teams, and SREs who need to quickly identify and resolve problems in their Kubernetes clusters. We’ll cover everything from basic cluster health checks to advanced debugging techniques, with practical examples for each command.

Checking Cluster Status

Begin your troubleshooting journey by assessing the overall health of your cluster:

# Show cluster details and endpoints
kubectl cluster-info

# List all nodes in the cluster
kubectl get nodes

# Show nodes with additional details
kubectl get nodes -o wide

# Get detailed information about a node
kubectl describe node nodename

# Show CPU & memory usage of nodes
kubectl top node

The cluster-info command provides a quick overview of your control plane and core services. When nodes report “NotReady” status, use describe node to investigate the conditions section for clues. The top node command helps identify resource constraints that might affect pod scheduling.

Practical Example:

$ kubectl top node
NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
worker-node1   856m         42%    7892Mi          52%
worker-node2   1532m        76%    13948Mi         93%     # High memory usage detected
worker-node3   215m         10%    2957Mi          19%

In this example, worker-node2 shows high memory utilization (93%), which could cause pod evictions or prevent new pods from being scheduled.

Checking Pod Status

Once you’ve verified cluster health, examine your application pods:

# List all pods in the current namespace
kubectl get pods

# List all pods in all namespaces
kubectl get pods -A

# Show pods with node assignments
kubectl get pods -o wide

# Get detailed information about a pod
kubectl describe pod podname

# Show CPU & memory usage of pods
kubectl top pod

# Get full YAML definition of a pod
kubectl get pod podname -o yaml

# Show recent events for troubleshooting
kubectl get events --sort-by=.metadata.creationTimestamp

The get pods command quickly surfaces problematic pods by status (CrashLoopBackOff, Error, Pending). The -o wide flag is invaluable as it shows node assignments, which helps correlate pod issues with potential node problems. Events are chronologically ordered with the --sort-by flag to see the most recent issues first.

Pod Status Troubleshooting Matrix:

Status	Common Causes	First Commands to Run
Pending	Resource constraints, PVC binding issues	`kubectl describe pod`, `kubectl get events`
CrashLoopBackOff	Application errors, liveness probe failures	`kubectl logs`, `kubectl describe pod`
Error	Image pull failures, init container errors	`kubectl describe pod`, `kubectl get events`
Completed	Job or cronjob ran successfully	`kubectl logs` to verify expected output
Running but unhealthy	Readiness probe failures, application issues	`kubectl describe pod`, `kubectl logs`, `kubectl exec`

Checking Logs

Logs are essential for diagnosing application-specific issues:

# Show logs for a pod's main container
kubectl logs podname

# Show logs for a specific container in a pod
kubectl logs podname -c containername

# Show logs from a previous crashed container
kubectl logs podname --previous

# Stream live logs from a pod
kubectl logs -f podname

# Show the last 100 log lines for all pods labeled "app=myapp"
kubectl logs -l app=myapp --tail=100

The --previous flag is particularly valuable when pods have restarted after a crash, letting you see what happened before the failure. For multi-container pods, the -c flag targets a specific container. The -f flag streams logs in real-time, which is useful for observing behavior during testing.

Advanced Log Analysis:

# Combine logs from multiple pods with the same label
kubectl logs -l app=frontend --tail=50 | grep ERROR

# Check logs within a specific time window
kubectl logs podname --since=1h

# Save logs to a file for analysis
kubectl logs deployment/api-service --all-containers=true > api-logs.txt

Troubleshooting Pods

For deeper investigation of problematic pods:

# View full YAML definition of a pod
kubectl get pod podname -o yaml

# Get detailed information about a pod
kubectl describe pod podname

# Show cluster events to check failures
kubectl get events

# Show detailed pod conditions
kubectl get pod podname -o jsonpath='{.status.conditions[*]}'

# Open a shell inside a running pod
kubectl exec -it podname -c containername -- bash

# Open a shell in a specific container
kubectl exec -it podname -c containername -- sh

# Forward local port 8080 to a pod's port 80
kubectl port-forward podname 8080:80

# Delete a problematic pod to force a restart
kubectl delete pod podname

The jsonpath option extracts specific fields from complex pod definitions, allowing targeted analysis. The exec command provides shell access within a container, essential for verifying environment variables, checking file systems, or testing connectivity from inside the pod.

Practical Debugging Example:

# Check if a pod can resolve DNS
kubectl exec -it frontend-pod -- nslookup backend-service

# Verify connectivity to another service
kubectl exec -it frontend-pod -- curl -v backend-service:8080

# Inspect file system permissions
kubectl exec -it database-pod -- ls -la /data/db

Checking Deployments & ReplicaSets

Many issues originate at the deployment level:

# List all deployments
kubectl get deployments

# Get detailed deployment information
kubectl describe deployment deploymentname

# List all ReplicaSets
kubectl get rs

# Scale deployment to 3 replicas
kubectl scale deployment deploymentname --replicas=3

# Check deployment rollout status
kubectl rollout status deployment deploymentname

# Rollback to the previous deployment
kubectl rollout undo deployment deploymentname

# Restart a deployment
kubectl rollout restart deployment deploymentname

Deployments manage ReplicaSets, which in turn manage Pods. Understanding this relationship is crucial for troubleshooting. The rollout commands help manage deployment lifecycle, with undo being particularly useful for quickly reverting problematic deployments.

Deployment Strategy Analysis:

# Check the deployment strategy
kubectl get deployment deploymentname -o jsonpath='{.spec.strategy}'

# See the history of rollouts
kubectl rollout history deployment deploymentname

# Compare specific revisions
kubectl rollout history deployment deploymentname --revision=2

Checking Services & Networking

Network connectivity issues are common in Kubernetes:

# List all services
kubectl get svc

# Get details of a specific service
kubectl describe svc servicename

# Show the actual pod IP behind a service
kubectl get endpoints servicename

# List all ingress resources
kubectl get ingress

# Get detailed ingress configuration
kubectl describe ingress ingressname

# List network policies
kubectl get networkpolicy

# Describe a specific network policy
kubectl describe networkpolicy policyname

Services provide stable network identities for pods through abstract endpoints. When debugging connectivity, check if endpoints exist for your service (kubectl get endpoints) and verify that selectors match pod labels. Ingress resources configure HTTP/HTTPS routing, and network policies define allowed traffic flows.

Network Debugging Process:

Verify service definition: kubectl describe svc servicename
Check if endpoints exist: kubectl get endpoints servicename
Verify pod labels match service selector: kubectl get pods --show-labels
Test connectivity from within the cluster: kubectl exec -it debug-pod -- curl servicename:port
Check network policies: kubectl get networkpolicy

Checking Persistent Volumes & Storage

Storage issues can cause persistent application problems:

# List all Persistent Volume Claims (PVCs)
kubectl get pvc

# Describe a specific PVC
kubectl describe pvc pvcname

# List all Persistent Volumes (PVs)
kubectl get pv

# Get details of a specific PV
kubectl describe pv pvname

Storage troubleshooting centers around the relationship between PVs (the actual storage) and PVCs (the request for storage). Common issues include mismatched access modes, capacity issues, or reclaim policy problems. When a pod is stuck in “ContainerCreating” with a mounted volume, investigate the PVC status first.

Storage Troubleshooting Checklist:

PVC stuck in “Pending”: Check for available PVs, StorageClass issues, or capacity problems
Pod can’t mount volume: Verify node has access to the storage backend
Data persistence issues: Check reclaim policy, access modes, and filesystem permissions
Performance problems: Investigate storage class parameters and underlying infrastructure

Debugging with Ephemeral Containers (Kubernetes 1.23+)

For advanced debugging in newer Kubernetes versions:

# Add a debugging container to a running pod
kubectl debug podname -it --image=busybox

# Debug a node using a container
kubectl debug node/nodename --image=busybox

Ephemeral containers allow you to attach debugging tools to running pods without modifying their definition. This is particularly useful for minimal containers that don’t include shell or debugging utilities.

Example Debugging Session:

# Add a debugging container with network troubleshooting tools
kubectl debug nginx-pod -it --image=nicolaka/netshoot --target=nginx

# Now inside the debug container, you can run network diagnostics
$ ping google.com
$ tcpdump -i eth0
$ netstat -tuln

Restarting Resources

Sometimes, the quickest fix is a restart:

# Delete a pod to force a restart
kubectl delete pod podname

# Restart a deployment
kubectl rollout restart deployment deploymentname

# Delete all pods in the namespace
kubectl delete pod --all

While not always the most elegant solution, restarting resources can quickly resolve many issues, particularly those related to transient state or memory leaks. The rollout restart command performs a graceful rolling restart of all pods in a deployment.

Selective Restart Strategies:

# Restart only pods with a specific label
kubectl delete pods -l app=frontend

# Force deletion of stuck pods
kubectl delete pod stuck-pod --grace-period=0 --force

# Restart all deployments in a namespace
kubectl rollout restart deployment --all

Checking and Debugging Kubernetes DNS

DNS issues are a common source of application connectivity problems:

# Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS resolution inside a pod
kubectl exec -it podname -- nslookup myservice

# Use dig to test DNS resolution
kubectl exec -it podname -- dig myservice

Kubernetes DNS (typically CoreDNS) provides service discovery. When applications can’t find services by name, check if CoreDNS pods are healthy, then test DNS resolution from within application pods.

DNS Troubleshooting Steps:

Verify CoreDNS is running: kubectl get pods -n kube-system -l k8s-app=kube-dns
Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns
Test basic DNS resolution: kubectl exec -it test-pod -- nslookup kubernetes.default
Test service resolution: kubectl exec -it test-pod -- nslookup myservice.namespace.svc.cluster.local
Check stubDomains and upstream nameservers in CoreDNS ConfigMap: kubectl get configmap -n kube-system coredns -o yaml

Checking Role-Based Access Control (RBAC)

Authorization issues can prevent legitimate operations:

# Check if current user can list pods
kubectl auth can-i list pods

# Test permissions as another user
kubectl auth can-i delete deployments --as=admin

# List roles in the current namespace
kubectl get roles

# List cluster-wide roles
kubectl get clusterroles

# Get details about a RoleBinding
kubectl describe rolebinding myrolebinding

# Get details about a ClusterRoleBinding
kubectl describe clusterrolebinding myclusterrolebinding

RBAC troubleshooting focuses on verifying that users and service accounts have appropriate permissions. The auth can-i command is particularly useful for testing specific permissions, while describe rolebinding shows the association between roles and subjects.

Common RBAC Patterns:

# Check what API resources exist in the cluster
kubectl api-resources

# Check what verbs (actions) can be performed on pods
kubectl api-resources --verbs=list --namespaced -o name | grep pods

# Verify service account permissions
kubectl auth can-i list pods --as=system:serviceaccount:default:default

Troubleshooting API Server & Certificates

API server issues affect all Kubernetes operations:

# List certificate signing requests
kubectl get csr

# Check the API server health
kubectl get --raw /healthz

# Show the health of cluster components
kubectl get componentstatus

# Start a proxy to access API server directly
kubectl proxy

The API server is the central management point for Kubernetes. Certificate issues often manifest as authentication failures, while API server health problems can cause widespread cluster disruption. The /healthz endpoint provides a quick way to verify API server status.

Certificate Management:

# Approve a certificate signing request
kubectl certificate approve csr-name

# Deny a certificate signing request
kubectl certificate deny csr-name

# View details of a CSR
kubectl get csr csr-name -o yaml

Using K9s for Interactive Troubleshooting (If Installed)

For a more interactive troubleshooting experience:

# Start K9s CLI tool for Kubernetes management
k9s

K9s provides a terminal-based UI that simplifies many Kubernetes operations. It offers real-time views of resources, logs, and metrics, with keyboard shortcuts for common actions.

K9s Key Features:

Real-time resource monitoring
Log streaming and searching
Quick access to shell and describe operations
Resource utilization views
Port-forwarding capabilities

Advanced Troubleshooting Techniques

Beyond the basic commands, advanced troubleshooting may require specialized approaches:

Analyzing Resource Utilization:

# Get resource usage of all pods in a namespace
kubectl top pods

# Sort pods by CPU usage
kubectl top pods --sort-by=cpu

# Check resource requests and limits
kubectl describe pods | grep -A 3 "Requests|Limits"

Auditing Cluster Events:

# Get all recent events, sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp

# Filter events by reason
kubectl get events --field-selector reason=Failed

# Watch events in real-time
kubectl get events --watch

Inspecting Control Plane Logs (for self-managed clusters):

# SSH to control plane node and check component logs
sudo journalctl -u kubelet
sudo journalctl -u kube-apiserver
sudo journalctl -u kube-scheduler
sudo journalctl -u kube-controller-manager

Comprehensive Troubleshooting Workflows

Let’s put these commands together into workflows for common scenarios:

Workflow 1: Application Pod Won’t Start

Check pod status: kubectl get pods
Examine pod details: kubectl describe pod problematic-pod
Check events: kubectl get events --sort-by=.metadata.creationTimestamp
If “Pending”: Check node resources with kubectl describe nodes
If “ImagePullBackOff”: Verify image name, registry access, and pull secrets
If “CrashLoopBackOff”: Check logs with kubectl logs problematic-pod --previous

Workflow 2: Service Connectivity Issues

Verify service exists: kubectl get svc service-name
Check service details: kubectl describe svc service-name
Verify endpoints exist: kubectl get endpoints service-name
Check pod labels match service selector: kubectl get pods --show-labels
Test DNS resolution: kubectl exec -it test-pod -- nslookup service-name
Test direct connectivity: kubectl exec -it test-pod -- curl pod-ip:port
Check network policies: kubectl get networkpolicy

Workflow 3: Node Problems

Check node status: kubectl get nodes
Examine node details: kubectl describe node problematic-node
Check node resource usage: kubectl top node
Verify kubelet is running (on node): systemctl status kubelet
Check kubelet logs (on node): journalctl -u kubelet
Review pods on the node: kubectl get pods --field-selector spec.nodeName=problematic-node

Workflow 4: Deployment Rollout Issues

Check deployment status: kubectl get deployment deployment-name
Examine rollout status: kubectl rollout status deployment deployment-name
Check replica sets: kubectl get rs | grep deployment-name
Examine pod status: kubectl get pods | grep deployment-name
Check events: kubectl get events --sort-by=.metadata.creationTimestamp
If needed, rollback: kubectl rollout undo deployment deployment-name

Best Practices for Kubernetes Troubleshooting

Start Broad, Then Narrow: Begin with high-level resource checks, then drill down to specifics.
Check Events Early and Often: Events provide crucial chronological information about what’s happening in your cluster.
Use Labels for Filtering: Leverage Kubernetes labels to filter resources and focus your troubleshooting.
Create Dedicated Debug Pods: Deploy utility pods with debugging tools in affected namespaces.
Understand Resource Hierarchies: Problems often cascade from parent resources (Deployments) to children (ReplicaSets, Pods).
Maintain Historical Logs: Implement cluster-wide logging to retain historical context for troubleshooting.
Document Common Issues: Create runbooks for recurring problems with step-by-step resolution procedures.
Use Monitoring and Alerting: Proactively detect issues through comprehensive monitoring.
Check Multiple Levels: Some issues appear in application logs, others in node journals or control plane components.
Practice Chaos Engineering: Regularly test failure scenarios to improve troubleshooting skills.

Conclusion

Kubernetes troubleshooting is both an art and a science, requiring systematic investigation, deep platform knowledge, and practical experience. This comprehensive command reference provides the tools needed to diagnose and resolve most Kubernetes issues you’ll encounter in production environments.

By organizing your troubleshooting approach around resource types (clusters, nodes, pods, deployments, services, etc.) and following the workflows outlined here, you’ll be able to quickly identify the root cause of problems and implement effective solutions.

Remember that effective Kubernetes troubleshooting combines command-line expertise with a thorough understanding of how Kubernetes components interact. Use this guide as your companion in the journey toward Kubernetes operational excellence.

Cheers,

Sim