Mastering Kubernetes Logs

Kubernetes has revolutionized container orchestration, but with great power comes great complexity. When issues arise in your cluster, logs become your best friend for troubleshooting and understanding what’s happening under the hood. However, Kubernetes generates logs at multiple levels, each serving different purposes and containing different types of information.

Understanding where to look and what each log type tells you can mean the difference between quickly resolving an issue and spending hours debugging in the dark. This comprehensive guide will walk you through every type of Kubernetes log, where to find them, and how to interpret what they’re telling you.

The Kubernetes Logging Landscape

Kubernetes logging operates on multiple layers, from individual containers to the entire cluster infrastructure. Each layer provides unique insights into different aspects of your system’s behavior. Let’s explore each type systematically.

Container and Pod Level Logging

Container Logs: The Foundation of Troubleshooting

Location: /var/log/containers/*.log

Container logs are often your first stop when debugging application issues. These logs capture everything your application writes to stdout and stderr, making them essential for understanding application behavior, errors, and performance issues.

What you’ll find here:

Application error messages and stack traces
Runtime exceptions and crashes
Configuration problems
Performance bottlenecks
Custom application logging output

Common scenarios for checking container logs:

Your application is crashing repeatedly
Users report errors or unexpected behavior
Performance issues like slow response times
Configuration validation problems

Pro tip: Use kubectl logs <pod-name> -c <container-name> to access these logs directly, or kubectl logs <pod-name> --previous to see logs from a crashed container.

Pod Logs: Understanding Multi-Container Interactions

Location: /var/log/pods/*.log

Pod logs provide a broader view than individual container logs, especially valuable when dealing with multi-container pods. These logs help you understand how containers within a pod interact with each other and can reveal issues that aren’t apparent when looking at containers in isolation.

What you’ll find here:

Inter-container communication issues
Shared volume problems
Network connectivity issues between containers
Resource contention between containers in the same pod
Init container problems affecting main containers

When to check pod logs:

Sidecar containers aren’t working correctly
Service mesh issues (like Istio proxy problems)
Init containers are failing
Containers can’t communicate within the pod
Shared volume mounting issues

Kubelet: The Node Agent’s Perspective

Location: /var/log/kubelet.log

The kubelet is the primary node agent responsible for managing pods and containers on each node. Its logs provide crucial insights into pod lifecycle management, resource allocation, and communication with the control plane.

What you’ll find here:

Pod scheduling and startup issues
Resource allocation problems (CPU, memory, storage)
Image pulling failures
Volume mounting problems
Node health and capacity issues
Communication problems with the API server

Critical scenarios for kubelet logs:

Pods are stuck in “Pending” state
Containers fail to start
Image pull errors
Resource constraints causing pod evictions
Node becoming unresponsive or marked as “NotReady”

Example issues you might see:

Failed to pull image "nginx:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied

Control Plane Components: The Brain of Your Cluster

The control plane consists of several critical components, each with its own logging that provides insights into cluster-wide operations.

API Server Logs: The Central Nervous System

Location: /var/log/kube-apiserver.log

The API server is the central hub of your Kubernetes cluster. Every request to create, modify, or delete resources goes through the API server, making its logs invaluable for understanding cluster-wide operations and access patterns.

What you’ll find here:

Authentication and authorization failures
API request patterns and performance
Resource validation errors
Webhook failures
Client connection issues
Rate limiting events

Key debugging scenarios:

Users can’t access cluster resources
kubectl commands are failing
Custom resource definitions aren’t working
Admission controllers are rejecting requests
Performance issues with API operations

Controller Manager Logs: Resource Lifecycle Management

Location: /var/log/kube-controller-manager.log

The controller manager runs various controllers that manage the lifecycle of Kubernetes resources. These logs are essential for understanding why resources aren’t reaching their desired state.

What you’ll find here:

ReplicaSet scaling issues
Deployment rollout problems
Service endpoint updates
Node controller issues
Namespace deletion problems
Resource quota enforcement

Common troubleshooting scenarios:

Deployments are stuck and not progressing
ReplicaSets aren’t creating the right number of pods
Services aren’t updating their endpoints
Nodes aren’t being properly managed
Resource cleanup issues

Scheduler Logs: Pod Placement Decisions

Location: /var/log/kube-scheduler.log

The scheduler is responsible for placing pods on appropriate nodes based on resource requirements, constraints, and policies. Scheduler logs help you understand why pods aren’t being scheduled or are placed on unexpected nodes.

What you’ll find here:

Pod scheduling decisions and rationale
Resource constraint violations
Node affinity and anti-affinity rule evaluations
Taints and tolerations processing
Priority and preemption events

When to check scheduler logs:

Pods remain in “Pending” state
Pods are scheduled on inappropriate nodes
Resource requests aren’t being honored
Affinity rules aren’t working as expected
Priority classes aren’t being respected

etcd Logs: The Cluster’s Memory

Location: Varies based on etcd deployment method

etcd is Kubernetes’ distributed key-value store that maintains all cluster state. Its logs are crucial for understanding data consistency issues and cluster stability problems.

What you’ll find here:

Data consistency problems
Leader election issues
Network partition recovery
Backup and restore operations
Performance and latency issues

Critical scenarios:

Cluster state inconsistencies
etcd cluster member failures
Split-brain scenarios
Data corruption issues
Performance degradation

Node-Level Insights

System Logs (Syslog): The Operating System Perspective

Location: Local to each Linux distribution (typically /var/log/syslog or /var/log/messages)

System logs provide the operating system’s view of what’s happening on each node, including hardware issues, kernel problems, and system-level resource constraints.

What you’ll find here:

Hardware failures and warnings
Kernel panics and crashes
Memory pressure and OOM kills
Disk space and I/O issues
Network interface problems
Security events and violations

When to investigate system logs:

Nodes become unresponsive
Mysterious pod crashes
Performance degradation
Hardware alerts
Security incidents

Application-Level Logging

Application Logs: Your Code’s Voice

Location: /var/log/app.log (or wherever your application writes logs)

Application logs contain the business logic output of your containerized applications. These logs are customized by your development team and contain application-specific information.

What you’ll typically find:

Business logic errors
User interaction patterns
Performance metrics
Security events
Custom debugging information
Integration failures with external services

Custom Logs: Specialized Monitoring

Location: /var/log/custom-app.log

Custom logs are specialized logging outputs created for specific use cases, monitoring requirements, or compliance needs.

Examples include:

Audit logs for compliance
Security event logs
Performance monitoring data
Custom metrics and analytics
Integration logs with external systems

Best Practices for Kubernetes Logging

1. Implement Centralized Logging

Don’t rely on accessing logs directly on nodes. Implement a centralized logging solution using tools like:

ELK Stack (Elasticsearch, Logstash, Kibana)
EFK Stack (Elasticsearch, Fluentd, Kibana)
Grafana Loki with Promtail
Cloud-native solutions like AWS CloudWatch, Google Cloud Logging, or Azure Monitor

2. Structure Your Application Logs

Ensure your applications output structured logs (JSON format) with consistent fields:

{
  "timestamp": "2024-05-31T10:30:00Z",
  "level": "ERROR",
  "message": "Failed to connect to database",
  "service": "user-service",
  "traceId": "abc123",
  "error": "connection timeout"
}

3. Use Log Levels Appropriately

Implement proper log levels in your applications:

DEBUG: Detailed information for development
INFO: General operational information
WARN: Potentially harmful situations
ERROR: Error events that don’t stop the application
FATAL: Severe errors that cause application termination

4. Implement Log Rotation and Retention

Configure appropriate log rotation and retention policies to prevent disk space issues:

Set maximum log file sizes
Implement automated cleanup of old logs
Consider compliance requirements for log retention

5. Monitor Log Volume and Performance

Keep track of logging overhead:

Monitor log volume to prevent storage issues
Ensure logging doesn’t impact application performance
Use sampling for high-volume debug logs

Troubleshooting Workflows

Application Issues

Start with container logs for the specific pod
Check pod logs if multiple containers are involved
Review kubelet logs if pod lifecycle issues are suspected
Examine system logs for node-level problems

Cluster-Wide Issues

Begin with API server logs for authentication/authorization problems
Check controller manager logs for resource management issues
Review scheduler logs for pod placement problems
Investigate etcd logs for data consistency issues

Performance Problems

Application logs for business logic performance
Kubelet logs for resource allocation issues
System logs for hardware and OS-level constraints
API server logs for request processing bottlenecks

Conclusion

Understanding Kubernetes logs is essential for maintaining healthy, performant clusters. Each log type provides unique insights into different aspects of your system, from individual application behavior to cluster-wide operations. By knowing where to look and what to look for, you can dramatically reduce troubleshooting time and improve your cluster’s reliability.

Remember that effective logging is not just about collecting data—it’s about having the right information available when you need it. Implement centralized logging, use structured formats, and establish clear troubleshooting workflows to make the most of your Kubernetes logging strategy.

The investment in understanding and properly implementing Kubernetes logging will pay dividends in reduced downtime, faster issue resolution, and better overall system visibility. Start with the basics, gradually implement more sophisticated logging practices, and always keep learning as your Kubernetes expertise grows.

Cheers,

Sim