Taints and Tolerations

How do Taints and Tolerations work at the API level, and what triggers a NoExecute eviction?

In Kubernetes, while node affinity acts as a property that attracts Pods to specific nodes, taints and tolerations work in the exact opposite manner: they allow a node to proactively repel a set of Pods.

To deeply understand this architecture, you must look at how the control plane evaluates these rules at the API level, the strict semantic differences between the three taint effects, and the internal controllers responsible for executing evictions.

The API Mechanics of Taints and Tolerations

At the API level, a taint is a core property applied directly to a Node object, while a toleration is an attribute defined within a PodSpec.

When the Kubernetes scheduler evaluates where to place a Pod, it processes taints and tolerations like a filter. The system looks at all the taints applied to a candidate node and then ignores any taints for which the incoming Pod has a matching toleration. If any un-ignored taints remain, the scheduler applies the designated effect of those remaining taints to the Pod.

For a toleration to successfully "match" a taint, it must align across several API fields:

Key and Effect: The key and effect of the toleration must perfectly match the taint. An empty effect in the toleration acts as a wildcard, matching all effects for the specified key.
Operator: The operator dictates how the value is evaluated. The default is Equal, meaning the values must be strictly identical. If the operator is set to Exists, the value field must be left empty, and the toleration will match any value for that key.
Numeric Comparisons: Kubernetes also supports Gt (greater than) and Lt (less than) operators to match taints with integer values, which is highly useful for threshold-based scheduling like SLA tiers. For these operators to work, both the toleration and taint values must be valid integers.

The Three Taint Effects

The effect field of a taint explicitly dictates how the control plane will treat a Pod that does not tolerate the taint. These effects are divided between scheduling-time restrictions and execution-time (runtime) restrictions.

1. `NoSchedule`

This is a strict scheduling-time constraint enforced entirely by the kube-scheduler. If a node has at least one un-ignored NoSchedule taint, the scheduler will outright refuse to place new Pods onto that node.

Crucially, NoSchedule has absolutely no impact on Pods that are already running on the node. If you apply a NoSchedule taint to a node, existing Pods that do not tolerate the taint will continue running undisturbed.

2. `PreferNoSchedule`

This is a "soft" or "preference" version of NoSchedule, also enforced by the scheduler. If a node has an un-ignored PreferNoSchedule taint, the control plane will try to avoid placing the Pod on that node, but it is not a guaranteed restriction. If the cluster is highly utilized and no other feasible, untainted nodes exist, the scheduler will still place the Pod on the tainted node.

3. `NoExecute`

Unlike the previous two effects, NoExecute is an execution-time constraint enforced by the NodeController (or historically, the taint-eviction-controller). It aggressively affects Pods that are already running on the node.

If a NoExecute taint is applied to a node, the following execution rules apply:

Pods that do not tolerate the taint are evicted immediately.
Pods that do tolerate the taint, and do not specify a tolerationSeconds field, remain bound to the node indefinitely.
Pods that tolerate the taint but include a tolerationSeconds duration (e.g., 3600) will stay bound to the node only for that specified amount of time. Once that timer expires, the node lifecycle controller evicts the Pod.

What Triggers an Eviction?

Eviction via taints is strictly triggered by the presence of a NoExecute taint on a node. This can happen in two primary ways: manual administration and automated taint-based evictions.

1. Manual Administration

Cluster operators can intentionally trigger evictions by applying a NoExecute taint using the kubectl taint nodes command. This is frequently done to safely drain a node for maintenance, or to enforce "Dedicated Nodes".

For example, if an administrator wants to reserve a set of expensive GPU nodes exclusively for a specific data science team, they can taint those nodes with dedicated=data-science:NoExecute. Any existing unauthorized Pods will be immediately evicted, and only Pods with the corresponding toleration can remain or be scheduled there.

2. Automated Taint-Based Evictions

The Kubernetes node controller automatically monitors the health of nodes and translates specific hardware or network conditions into taints, triggering automated evictions.

Node Unreachable: If the control plane loses contact with the kubelet, the node's Ready condition changes to Unknown. The node controller automatically applies the node.kubernetes.io/unreachable:NoExecute taint.
Node Not Ready: If the node is reporting unhealthy statuses, the condition changes to False, and the controller applies the node.kubernetes.io/not-ready:NoExecute taint.
Resource Pressure: Conditions like MemoryPressure, DiskPressure, and PIDPressure typically trigger NoSchedule taints to prevent new workloads from overwhelming the node, but do not directly trigger NoExecute taints. (Actual runtime eviction for resource starvation is handled separately by the kubelet's node-pressure eviction mechanisms, which rely on local cgroup metrics rather than API taints).

Real-World Protection Mechanisms: To prevent catastrophic cascading failures during brief network partitions, Kubernetes automatically injects default tolerations for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable into every new Pod, with a tolerationSeconds value of 300. This means that if a switch goes down and a node becomes unreachable, the automated NoExecute taint is applied, but the Pods will wait for 5 minutes before actually being evicted, giving the network time to recover.

Furthermore, foundational system workloads like DaemonSets are automatically granted NoExecute tolerations without a tolerationSeconds limit for these unreachability taints, ensuring that critical daemons like network plugins are never evicted during node faults.

How do Taints and Tolerations work at the API level, and what triggers a NoExecute eviction?

The API Mechanics of Taints and Tolerations ​

The Three Taint Effects ​

1. NoSchedule ​

2. PreferNoSchedule ​

3. NoExecute ​

What Triggers an Eviction? ​

1. Manual Administration ​

2. Automated Taint-Based Evictions ​