Appearance
Top-down troubleshooting methodology
During an active incident, randomly digging through container logs is an inefficient way to find the root cause. Because Kubernetes is a massively distributed system, a failing pod is usually just the symptom, not the underlying issue.
You need a structured, top-down approach. You start at the macroscopic control-plane level, isolate the failing domain, and then drill microscopically into the specific node or container runtime.
Here is the exact battle-tested sequence of kubectl commands we can use to interrogate a misbehaving cluster.
1. Is the Brain Alive? (Control Plane Failures)
Before you start blaming the application, you have to verify that the Kubernetes API server is actually talking to you. If the control plane is unstable, nothing else matters.
kubectl versionIf the API server is completely unresponsive, this command will immediately time out and return ani/o timeouterror. If it returns the server version instantly, your control plane is reachable.kubectl cluster-info dumpIf the API is responding but the cluster feels generally erratic, use this. It spits out a massive dump of routing health and core service metrics.kubectl get pods -n kube-systemCrucially, check the system namespace. Components likeetcdand thekube-schedulerrun as static pods. If you see them crash-looping here, you're dealing with a fundamental infrastructure failure, not an application bug.
2. Is Compute Healthy? (Worker nodes and Partitions)
If the API server is good, you check the worker nodes next. Has an underlying compute instance failed? Is a node suffering a silent network partition?
kubectl get nodesThis is your holistic capacity check. Look directly at theSTATUScolumn—if a node has dropped its heartbeat, it will explicitly broadcastNotReadyorUnknown.kubectl describe node <node-name>Once you spot a struggling node, you describe it to find out why. Scroll straight to theConditionsblock. Have things likeNetworkUnavailable,MemoryPressure, orDiskPressuretransitioned toTrue? The chronologically orderedEventssection at the bottom will literally tell you the exact moment the machine started dying.
3. Are We Starving? (OOMs and Evictions)
Usually, infrastructure is fine, but greedy applications are choking the node to death. You need to verify if the node is actively starving, or if the Linux kernel has already stepped in.
kubectl top nodesandkubectl top podsThese query the Metrics API directly. Run these to instantly find the "noisy neighbor" workloads currently monopolizing CPU and memory.kubectl describe pod <pod-name>If a pod suddenly vanished and respawned, look at theLast Stateblock. If you seeReason: OOMKilledand an exit code of137, your application ran out of memory and the Linux kernel forcefully terminated it.kubectl get events --all-namespacesThis is highly valuable. Filtering through global events will instantly flag indicators likeFailedScheduling(usually due to "Insufficient cpu") orEvictedhooks where the kubelet forcefully terminated workloads to stabilize the node.
4. Deep Node Inspection (Bypassing SSH)
Sometimes a node is heavily degraded but still partially talking to the control plane. In modern locked-down environments, you often don't have direct SSH access to your worker nodes. Thankfully, Kubernetes provides native hacks for this.
kubectl get --raw "/api/v1/nodes/<node-name>/proxy/logs/?query=kubelet"This is a massive time-saver. If the node proxy is still alive, this internal API call will fetch the host's rawsystemdjournal logs for the kubelet service and stream them right to your terminal—zero SSH required.kubectl debug node/<node-name> -it --image=ubuntuIf you desperately need a shell, this deploys a highly-privileged, interactive pod directly onto the broken node. Most importantly, it mounts the actual host's filesystem at/host. You can literallycat /host/var/log/containerd.logusing standard Linux tools from inside your ephemeral container.
5. Digging into the Workload
If the infrastructure, nodes, and kubelets are pristine, the application itself is failing (the classic CrashLoopBackOff).
kubectl logs <pod-name> -c <container-name> --previousLooking at the current logs of a crash-looping pod is useless because the process hasn't done anything yet. Using--previouspulls the exact log buffer from the final moments of the dead container, showing you the exact stack trace that killed it.kubectl debug <pod-name> -it --image=busybox --target=<container-name>Modern "distroless" images are incredibly secure but difficult to debug because they lack tools likeshorcurl. This command resolves that: it injects a freshbusyboxcontainer directly into the failing pod's running process namespace. You get a fully interactive shell where you can runpsornetstatdirectly against the application without modifying its locked-down image.