Skip to content

How do you observe and troubleshoot Kubernetes clusters (logs, metrics, events, probes, debugging patterns)?

Let's categorize observability and troubleshooting into three distinct layers: Signal Collection (getting data out of the cluster), Health Automation (how the cluster self-corrects based on signals), and Active Debugging (how you intervene when automation fails).

Effective Kubernetes operations rely on correlating metrics, logs, and events to form a complete picture of cluster health.


1. The Observability Pipeline: Metrics, Logs, and Traces

Kubernetes does not natively store long-term data. It generates signals that must be collected by external systems.

Metrics Architecture

Kubernetes splits metrics into two pipelines: the Resource Metrics Pipeline (core) and the Full Metrics Pipeline (monitoring).

  • Resource Metrics (The Core):
    • Component: Metrics Server. It collects resource metrics (CPU and Memory) from the kubelet on each node and exposes them via the metrics.k8s.io API.
    • Purpose: This data drives autoscaling (HPA/VPA) and the kubectl top command. It is lightweight, short-term, and in-memory.
    • Data Flow: cAdvisor (inside kubelet) $\rightarrow$ kubelet $\rightarrow$ metrics-server $\rightarrow$ API Server.
  • Full Metrics (Observability):
    • Components: Tools like Prometheus, Datadog, or Dynatrace.
    • Purpose: Long-term storage, alerting, and deep analysis. These systems access richer metrics from the kubelet's /metrics/resource and /metrics/probes endpoints, or via the Custom Metrics API (custom.metrics.k8s.io) for scaling on application-specific data (e.g., queue depth).

Logging Architecture

Kubernetes does not provide a native storage solution for log data; logs should have a lifecycle independent of the Pods that generate them.

  • Pod Logs: The container runtime redirects stdout and stderr streams to files (usually in /var/log/pods). The kubelet makes these available via kubectl logs.
  • System Logs: Components like the scheduler or kube-proxy running in containers write to /var/log. Kubelet and container runtime logs are typically found in journald on systemd systems.
  • Collection Patterns:
    1. Node-Level Agent (Recommended): A DaemonSet (e.g., Fluentd, Fluent Bit) runs on every node, mounting the /var/log directory, and shipping logs to a central backend (Elasticsearch, Splunk, Loki).
    2. Sidecar Streaming: If an application writes to a file instead of stdout, a sidecar container running tail -f /path/to/log can redirect that file to its own stdout so the node-level agent can pick it up.

2. Health Automation: Probes

Probes are the mechanism by which the cluster monitors specific containers to perform automated self-healing or traffic management.

  • Liveness Probes:
    • Goal: Detect deadlocks where the application is running but unable to make progress.
    • Action: If failed, the kubelet restarts the container.
    • Best Practice: Do not use this for checking external dependencies (like a database); if the DB is down, restarting the web server won't fix it and causes cascading failures.
  • Readiness Probes:
    • Goal: Detect when an application is ready to accept traffic (e.g., after loading configuration or warming caches).
    • Action: If failed, the Pod is removed from Service Load Balancers. The Pod remains running.
  • Startup Probes:
    • Goal: Handle slow-starting legacy applications.
    • Action: Disables Liveness and Readiness checks until the Startup probe succeeds. This prevents the kubelet from killing a slow-starting app prematurely.

3. Troubleshooting Workflows & Patterns

When automation fails, you must switch to active debugging.

A. Analyzing Cluster State (Events)

Events are the first place to look when Pods are not starting. They provide a timeline of what the scheduler and kubelet are doing.

  • Command: kubectl describe pod <pod-name> or kubectl get events.
  • Common Insights:
    • FailedScheduling: Insufficient CPU/Memory or Taint/Toleration mismatches.
    • ImagePullBackOff: Authorization failure or incorrect image name.
    • Unhealthy: Probe failures.

B. Debugging Running Pods

If a Pod is running but behaving incorrectly:

  1. Logs: Use kubectl logs <pod-name> -c <container-name>. For previously crashed containers, add --previous to see why it died.
  2. Exec: Use kubectl exec -it <pod-name> -- /bin/sh to enter the container namespace for file system inspection or network tests.
  3. Ephemeral Containers (Advanced):
    • Problem: Distroless images (images without shells/tools) cannot be debugged via exec.
    • Solution: Use kubectl debug -it <pod-name> --image=busybox --target=<container-name>. This injects a new container with tools into the existing Pod's process namespace, allowing you to debug the running process without restarting it.

C. Debugging Nodes

If a node is NotReady or suspect:

  • Node Problem Detector: A DaemonSet that monitors kernel logs and system stats, reporting them as NodeConditions or Events.
  • Crictl: Use the crictl CLI on the node to inspect the container runtime directly (bypassing the kubelet). This is vital if the kubelet is unresponsive.
  • Debug Pod: Use kubectl debug node/<node-name> -it --image=ubuntu to deploy a privileged pod that mounts the node's root filesystem at /host, allowing you to inspect system logs (/host/var/log) and configuration.

D. Debugging Networking (DNS & Services)

Service discovery issues are common.

  1. Verify DNS: Launch a debug pod and run nslookup kubernetes.default. If this fails, check the CoreDNS Pods in kube-system.
  2. Check Endpoints: A Service needs EndpointSlices to function. Run kubectl get endpointslices -l kubernetes.io/service-name=<svc-name>. If the list is empty, the Service selector does not match any Pod labels.
  3. Kube-Proxy: If endpoints exist but connection times out, check if kube-proxy is running on the nodes and inspect its logs for iptables or IPVS errors.