Appearance
Taints and Tolerations
How do Taints and Tolerations work at the API level, and what triggers a NoExecute eviction?
In Kubernetes, while node affinity acts as a property that attracts Pods to specific nodes, taints and tolerations work in the exact opposite manner: they allow a node to proactively repel a set of Pods.
To deeply understand this architecture, you must look at how the control plane evaluates these rules at the API level, the strict semantic differences between the three taint effects, and the internal controllers responsible for executing evictions.
The API Mechanics of Taints and Tolerations
At the API level, a taint is a core property applied directly to a Node object, while a toleration is an attribute defined within a PodSpec.
When the Kubernetes scheduler evaluates where to place a Pod, it processes taints and tolerations like a filter. The system looks at all the taints applied to a candidate node and then ignores any taints for which the incoming Pod has a matching toleration. If any un-ignored taints remain, the scheduler applies the designated effect of those remaining taints to the Pod.
For a toleration to successfully "match" a taint, it must align across several API fields:
- Key and Effect: The
keyandeffectof the toleration must perfectly match the taint. An emptyeffectin the toleration acts as a wildcard, matching all effects for the specified key. - Operator: The
operatordictates how thevalueis evaluated. The default isEqual, meaning the values must be strictly identical. If the operator is set toExists, thevaluefield must be left empty, and the toleration will match any value for that key. - Numeric Comparisons: Kubernetes also supports
Gt(greater than) andLt(less than) operators to match taints with integer values, which is highly useful for threshold-based scheduling like SLA tiers. For these operators to work, both the toleration and taint values must be valid integers.
The Three Taint Effects
The effect field of a taint explicitly dictates how the control plane will treat a Pod that does not tolerate the taint. These effects are divided between scheduling-time restrictions and execution-time (runtime) restrictions.
1. NoSchedule
This is a strict scheduling-time constraint enforced entirely by the kube-scheduler. If a node has at least one un-ignored NoSchedule taint, the scheduler will outright refuse to place new Pods onto that node.
Crucially, NoSchedule has absolutely no impact on Pods that are already running on the node. If you apply a NoSchedule taint to a node, existing Pods that do not tolerate the taint will continue running undisturbed.
2. PreferNoSchedule
This is a "soft" or "preference" version of NoSchedule, also enforced by the scheduler. If a node has an un-ignored PreferNoSchedule taint, the control plane will try to avoid placing the Pod on that node, but it is not a guaranteed restriction. If the cluster is highly utilized and no other feasible, untainted nodes exist, the scheduler will still place the Pod on the tainted node.
3. NoExecute
Unlike the previous two effects, NoExecute is an execution-time constraint enforced by the NodeController (or historically, the taint-eviction-controller). It aggressively affects Pods that are already running on the node.
If a NoExecute taint is applied to a node, the following execution rules apply:
- Pods that do not tolerate the taint are evicted immediately.
- Pods that do tolerate the taint, and do not specify a
tolerationSecondsfield, remain bound to the node indefinitely. - Pods that tolerate the taint but include a
tolerationSecondsduration (e.g.,3600) will stay bound to the node only for that specified amount of time. Once that timer expires, the node lifecycle controller evicts the Pod.
What Triggers an Eviction?
Eviction via taints is strictly triggered by the presence of a NoExecute taint on a node. This can happen in two primary ways: manual administration and automated taint-based evictions.
1. Manual Administration
Cluster operators can intentionally trigger evictions by applying a NoExecute taint using the kubectl taint nodes command. This is frequently done to safely drain a node for maintenance, or to enforce "Dedicated Nodes".
For example, if an administrator wants to reserve a set of expensive GPU nodes exclusively for a specific data science team, they can taint those nodes with dedicated=data-science:NoExecute. Any existing unauthorized Pods will be immediately evicted, and only Pods with the corresponding toleration can remain or be scheduled there.
2. Automated Taint-Based Evictions
The Kubernetes node controller automatically monitors the health of nodes and translates specific hardware or network conditions into taints, triggering automated evictions.
- Node Unreachable: If the control plane loses contact with the
kubelet, the node'sReadycondition changes toUnknown. The node controller automatically applies thenode.kubernetes.io/unreachable:NoExecutetaint. - Node Not Ready: If the node is reporting unhealthy statuses, the condition changes to
False, and the controller applies thenode.kubernetes.io/not-ready:NoExecutetaint. - Resource Pressure: Conditions like
MemoryPressure,DiskPressure, andPIDPressuretypically triggerNoScheduletaints to prevent new workloads from overwhelming the node, but do not directly triggerNoExecutetaints. (Actual runtime eviction for resource starvation is handled separately by thekubelet's node-pressure eviction mechanisms, which rely on local cgroup metrics rather than API taints).
Real-World Protection Mechanisms: To prevent catastrophic cascading failures during brief network partitions, Kubernetes automatically injects default tolerations for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable into every new Pod, with a tolerationSeconds value of 300. This means that if a switch goes down and a node becomes unreachable, the automated NoExecute taint is applied, but the Pods will wait for 5 minutes before actually being evicted, giving the network time to recover.
Furthermore, foundational system workloads like DaemonSets are automatically granted NoExecute tolerations without a tolerationSeconds limit for these unreachability taints, ensuring that critical daemons like network plugins are never evicted during node faults.