Appearance
What happens when a Pod or Node fails, and how does Kubernetes detect and recover from failures?
Lets break down the failure handling mechanisms in Kubernetes.
Kubernetes operates on a declarative model called desired state. The system continuously monitors the current state of the cluster and uses control loops to reconcile any differences. When a failure occurs—whether at the container, Pod, or Node level—it is treated as a deviation from the desired state, triggering specific self-healing workflows.
Here is the technical breakdown of how Kubernetes detects and recovers from failures at different layers of the stack.
1. Container and Pod Failures
At the workload level, the primary agent of recovery is the kubelet (running on the node) for local issues, and Controllers (running in the Control Plane) for total Pod loss.
Detection Mechanisms
- Process Exit Codes: The kubelet monitors the lifecycle of containers. If a container's main process exits with a non-zero status code, it is considered a failure.
- Liveness Probes: Process checks are insufficient for deadlocks (where the app is running but stuck). Liveness probes allow the kubelet to periodically check application health via HTTP, TCP, or gRPC. If the probe fails, the kubelet kills the container.
- Startup Probes: For slow-starting legacy applications, a startup probe holds off other probes until the application is fully initialized. If this fails, the container is killed and restarted.
Recovery Workflow
- Restart Policy: When a container fails, the kubelet checks the Pod's
restartPolicy(default:Always).- If set to
AlwaysorOnFailure, the kubelet restarts the container on the same node.
- If set to
- Exponential Backoff: To prevent a failing container from overwhelming the node, the kubelet implements an exponential backoff delay (10s, 20s, 40s…) capped at 5 minutes (300 seconds). This state is visible as
CrashLoopBackOff. - Controller Replacement: If the Pod cannot recover or is deleted, the higher-level Workload Controller (e.g., Deployment, ReplicaSet) observes that the number of running replicas is lower than the desired
replicascount and creates a completely new Pod to replace it.
2. Node Failures
Node failures are handled by the Control Plane, specifically the Node Controller and the Pod Garbage Collector.
Detection Mechanisms
- Heartbeats (Leases): Nodes send periodic heartbeats to the API server (via
Leaseobjects in thekube-node-leasenamespace). This proves availability. - Controller Monitoring: The Node Controller checks these heartbeats. If a node stops sending updates, the controller changes the Node's
.statusConditionReadytoUnknown.
Recovery Workflow
- Tainting: When a node becomes unhealthy, the Node Controller applies taints to it, such as
node.kubernetes.io/unreachableornode.kubernetes.io/not-ready. - Toleration Window: Pods typically have a default toleration for these taints for 300 seconds (5 minutes). This prevents massive rescheduling storms during minor network blips.
- Eviction: If the node remains unreachable after the toleration expires (5 minutes), the taint causes the API server to mark the Pods on that node for deletion/eviction.
- Rescheduling: The Workload Controller (e.g., Deployment) sees the Pods are terminating/gone. It creates new replacement Pods. The kube-scheduler then places these new Pods onto healthy nodes.
Important Distinction for StatefulSets: StatefulSets require "at most one" semantics to prevent data corruption (split-brain). If a node fails, Kubernetes will not automatically reschedule StatefulSet Pods because it cannot confirm the old Pod is truly dead (the node might just be partitioned). You must force-delete the Pod or taint the node as node.kubernetes.io/out-of-service to trigger recovery.
3. Resource Starvation (Node-Pressure Eviction)
Sometimes a node is "healthy" (heartbeating) but has run out of resources (Memory, Disk, or PIDs).
Detection and Recovery
- Eviction Signals: The kubelet monitors resources like
memory.availableornodefs.available. If these drop below configured thresholds (Soft or Hard limits), the node enters a pressure state. - Ranking: The kubelet proactively terminates Pods to reclaim resources. It ranks Pods for eviction based on their Quality of Service (QoS) class:
- BestEffort (No requests/limits) are evicted first.
- Burstable (Usage exceeds requests) are evicted next.
- Guaranteed (Usage within limits) are evicted last.
Summary Table
| Failure Type | Detected By | Primary Recovery Action |
|---|---|---|
| Container Crash | Kubelet (Exit Code) | Kubelet restarts container (Subject to Backoff). |
| App Deadlock | Kubelet (Liveness Probe) | Kubelet kills and restarts container. |
| Node Offline | Node Controller (Heartbeat) | Controller taints node; Pods evicted after 5 min; Replicas recreated on other nodes. |
| Resource Exhaustion | Kubelet (Thresholds) | Kubelet evicts lower QoS Pods to reclaim resources. |