How do you observe and troubleshoot Kubernetes clusters (logs, metrics, events, probes, debugging patterns)?

Let's categorize observability and troubleshooting into three distinct layers: Signal Collection (getting data out of the cluster), Health Automation (how the cluster self-corrects based on signals), and Active Debugging (how you intervene when automation fails).

Effective Kubernetes operations rely on correlating metrics, logs, and events to form a complete picture of cluster health.

1. The Observability Pipeline: Metrics, Logs, and Traces

Kubernetes does not natively store long-term data. It generates signals that must be collected by external systems.

Metrics Architecture

Kubernetes splits metrics into two pipelines: the Resource Metrics Pipeline (core) and the Full Metrics Pipeline (monitoring).

Resource Metrics (The Core):
- Component: Metrics Server. It collects resource metrics (CPU and Memory) from the kubelet on each node and exposes them via the metrics.k8s.io API.
- Purpose: This data drives autoscaling (HPA/VPA) and the kubectl top command. It is lightweight, short-term, and in-memory.
- Data Flow: cAdvisor (inside kubelet) $\rightarrow$ kubelet $\rightarrow$ metrics-server $\rightarrow$ API Server.
Full Metrics (Observability):
- Components: Tools like Prometheus, Datadog, or Dynatrace.
- Purpose: Long-term storage, alerting, and deep analysis. These systems access richer metrics from the kubelet's /metrics/resource and /metrics/probes endpoints, or via the Custom Metrics API (custom.metrics.k8s.io) for scaling on application-specific data (e.g., queue depth).

Logging Architecture

Kubernetes does not provide a native storage solution for log data; logs should have a lifecycle independent of the Pods that generate them.

Pod Logs: The container runtime redirects stdout and stderr streams to files (usually in /var/log/pods). The kubelet makes these available via kubectl logs.
System Logs: Components like the scheduler or kube-proxy running in containers write to /var/log. Kubelet and container runtime logs are typically found in journald on systemd systems.
Collection Patterns:
1. Node-Level Agent (Recommended): A DaemonSet (e.g., Fluentd, Fluent Bit) runs on every node, mounting the /var/log directory, and shipping logs to a central backend (Elasticsearch, Splunk, Loki).
2. Sidecar Streaming: If an application writes to a file instead of stdout, a sidecar container running tail -f /path/to/log can redirect that file to its own stdout so the node-level agent can pick it up.

2. Health Automation: Probes

Probes are the mechanism by which the cluster monitors specific containers to perform automated self-healing or traffic management.

Liveness Probes:
- Goal: Detect deadlocks where the application is running but unable to make progress.
- Action: If failed, the kubelet restarts the container.
- Best Practice: Do not use this for checking external dependencies (like a database); if the DB is down, restarting the web server won't fix it and causes cascading failures.
Readiness Probes:
- Goal: Detect when an application is ready to accept traffic (e.g., after loading configuration or warming caches).
- Action: If failed, the Pod is removed from Service Load Balancers. The Pod remains running.
Startup Probes:
- Goal: Handle slow-starting legacy applications.
- Action: Disables Liveness and Readiness checks until the Startup probe succeeds. This prevents the kubelet from killing a slow-starting app prematurely.

3. Troubleshooting Workflows & Patterns

When automation fails, you must switch to active debugging.

A. Analyzing Cluster State (Events)

Events are the first place to look when Pods are not starting. They provide a timeline of what the scheduler and kubelet are doing.

Command: kubectl describe pod <pod-name> or kubectl get events.
Common Insights:
- FailedScheduling: Insufficient CPU/Memory or Taint/Toleration mismatches.
- ImagePullBackOff: Authorization failure or incorrect image name.
- Unhealthy: Probe failures.

B. Debugging Running Pods

If a Pod is running but behaving incorrectly:

Logs: Use kubectl logs <pod-name> -c <container-name>. For previously crashed containers, add --previous to see why it died.
Exec: Use kubectl exec -it <pod-name> -- /bin/sh to enter the container namespace for file system inspection or network tests.
Ephemeral Containers (Advanced):
- Problem: Distroless images (images without shells/tools) cannot be debugged via exec.
- Solution: Use kubectl debug -it <pod-name> --image=busybox --target=<container-name>. This injects a new container with tools into the existing Pod's process namespace, allowing you to debug the running process without restarting it.

C. Debugging Nodes

If a node is NotReady or suspect:

Node Problem Detector: A DaemonSet that monitors kernel logs and system stats, reporting them as NodeConditions or Events.
Crictl: Use the crictl CLI on the node to inspect the container runtime directly (bypassing the kubelet). This is vital if the kubelet is unresponsive.
Debug Pod: Use kubectl debug node/<node-name> -it --image=ubuntu to deploy a privileged pod that mounts the node's root filesystem at /host, allowing you to inspect system logs (/host/var/log) and configuration.

D. Debugging Networking (DNS & Services)

Service discovery issues are common.

Verify DNS: Launch a debug pod and run nslookup kubernetes.default. If this fails, check the CoreDNS Pods in kube-system.
Check Endpoints: A Service needs EndpointSlices to function. Run kubectl get endpointslices -l kubernetes.io/service-name=<svc-name>. If the list is empty, the Service selector does not match any Pod labels.
Kube-Proxy: If endpoints exist but connection times out, check if kube-proxy is running on the nodes and inspect its logs for iptables or IPVS errors.

How do you observe and troubleshoot Kubernetes clusters (logs, metrics, events, probes, debugging patterns)?

1. The Observability Pipeline: Metrics, Logs, and Traces ​

Metrics Architecture ​

Logging Architecture ​

2. Health Automation: Probes ​

3. Troubleshooting Workflows & Patterns ​

A. Analyzing Cluster State (Events) ​

B. Debugging Running Pods ​

C. Debugging Nodes ​

D. Debugging Networking (DNS & Services) ​