Skip to content

Top-down troubleshooting methodology

During an active incident, randomly digging through container logs is an inefficient way to find the root cause. Because Kubernetes is a massively distributed system, a failing pod is usually just the symptom, not the underlying issue.

You need a structured, top-down approach. You start at the macroscopic control-plane level, isolate the failing domain, and then drill microscopically into the specific node or container runtime.

Here is the exact battle-tested sequence of kubectl commands we can use to interrogate a misbehaving cluster.

1. Is the Brain Alive? (Control Plane Failures)

Before you start blaming the application, you have to verify that the Kubernetes API server is actually talking to you. If the control plane is unstable, nothing else matters.

  • kubectl version If the API server is completely unresponsive, this command will immediately time out and return an i/o timeout error. If it returns the server version instantly, your control plane is reachable.
  • kubectl cluster-info dump If the API is responding but the cluster feels generally erratic, use this. It spits out a massive dump of routing health and core service metrics.
  • kubectl get pods -n kube-system Crucially, check the system namespace. Components like etcd and the kube-scheduler run as static pods. If you see them crash-looping here, you're dealing with a fundamental infrastructure failure, not an application bug.

2. Is Compute Healthy? (Worker nodes and Partitions)

If the API server is good, you check the worker nodes next. Has an underlying compute instance failed? Is a node suffering a silent network partition?

  • kubectl get nodes This is your holistic capacity check. Look directly at the STATUS column—if a node has dropped its heartbeat, it will explicitly broadcast NotReady or Unknown.
  • kubectl describe node <node-name> Once you spot a struggling node, you describe it to find out why. Scroll straight to the Conditions block. Have things like NetworkUnavailable, MemoryPressure, or DiskPressure transitioned to True? The chronologically ordered Events section at the bottom will literally tell you the exact moment the machine started dying.

3. Are We Starving? (OOMs and Evictions)

Usually, infrastructure is fine, but greedy applications are choking the node to death. You need to verify if the node is actively starving, or if the Linux kernel has already stepped in.

  • kubectl top nodes and kubectl top pods These query the Metrics API directly. Run these to instantly find the "noisy neighbor" workloads currently monopolizing CPU and memory.
  • kubectl describe pod <pod-name> If a pod suddenly vanished and respawned, look at the Last State block. If you see Reason: OOMKilled and an exit code of 137, your application ran out of memory and the Linux kernel forcefully terminated it.
  • kubectl get events --all-namespaces This is highly valuable. Filtering through global events will instantly flag indicators like FailedScheduling (usually due to "Insufficient cpu") or Evicted hooks where the kubelet forcefully terminated workloads to stabilize the node.

4. Deep Node Inspection (Bypassing SSH)

Sometimes a node is heavily degraded but still partially talking to the control plane. In modern locked-down environments, you often don't have direct SSH access to your worker nodes. Thankfully, Kubernetes provides native hacks for this.

  • kubectl get --raw "/api/v1/nodes/<node-name>/proxy/logs/?query=kubelet" This is a massive time-saver. If the node proxy is still alive, this internal API call will fetch the host's raw systemd journal logs for the kubelet service and stream them right to your terminal—zero SSH required.
  • kubectl debug node/<node-name> -it --image=ubuntu If you desperately need a shell, this deploys a highly-privileged, interactive pod directly onto the broken node. Most importantly, it mounts the actual host's filesystem at /host. You can literally cat /host/var/log/containerd.log using standard Linux tools from inside your ephemeral container.

5. Digging into the Workload

If the infrastructure, nodes, and kubelets are pristine, the application itself is failing (the classic CrashLoopBackOff).

  • kubectl logs <pod-name> -c <container-name> --previous Looking at the current logs of a crash-looping pod is useless because the process hasn't done anything yet. Using --previous pulls the exact log buffer from the final moments of the dead container, showing you the exact stack trace that killed it.
  • kubectl debug <pod-name> -it --image=busybox --target=<container-name> Modern "distroless" images are incredibly secure but difficult to debug because they lack tools like sh or curl. This command resolves that: it injects a fresh busybox container directly into the failing pod's running process namespace. You get a fully interactive shell where you can run ps or netstat directly against the application without modifying its locked-down image.

Based on Kubernetes v1.35 (Timbernetes). Changelog.