Advanced Scheduling

How does kube-scheduler score nodes, and how do NodeAffinity, PodAffinity, and PodAntiAffinity differ in the Filter and Score phases?

The Kubernetes kube-scheduler is responsible for finding the optimal placement for Pods across a cluster's Nodes. To achieve this, it employs a highly extensible, two-step architecture for every Pod it evaluates: Filtering (identifying feasible Nodes) and Scoring (ranking the feasible Nodes).

Understanding how the scheduler evaluates constraints and mathematically ranks nodes is critical for architects designing high-performance, resilient systems.

The Algorithmic Differences: NodeAffinity vs. PodAffinity vs. PodAntiAffinity

Affinity and anti-affinity rules allow you to express both "hard" guarantees (evaluated during the Filter phase) and "soft" preferences (evaluated during the Score phase). However, the underlying algorithms and their computational complexities vary drastically.

1. NodeAffinity

NodeAffinity restricts or prefers scheduling based entirely on the labels applied to the Nodes themselves.

Filter Phase (requiredDuringSchedulingIgnoredDuringExecution): The algorithm performs a straightforward set-matching operation. It checks the incoming Pod's selector against the labels of each candidate Node. If the match expressions (which are logically ANDed together) evaluate to false, the Node is dropped from the feasible list.
Score Phase (preferredDuringSchedulingIgnoredDuringExecution): For each preferred rule that a feasible Node satisfies, the scheduler adds the explicitly defined weight (a value between 1 and 100) to that Node's running score. Nodes with the highest accumulated weights are prioritized.

2. PodAffinity and PodAntiAffinity

While NodeAffinity evaluates static Node labels, Inter-Pod Affinity and Anti-Affinity evaluate the dynamic state of the cluster. They constrain scheduling based on the labels of other Pods already running in a specific topology domain (like a rack, zone, or hostname).

Filter Phase (Hard Constraints): When evaluating a new Pod, the scheduler must identify the topology domains of all existing Pods that match the requested label selector. For PodAffinity, the scheduler filters out any Node that does not belong to a topology domain containing at least one matching Pod. For PodAntiAffinity, the scheduler filters out Nodes belonging to a topology domain that already contains a matching Pod.
Score Phase (Soft Constraints): The scheduler scores nodes based on how well they meet the preferred inter-pod affinity and anti-affinity rules. For affinity, it adds the configured weight to the Node's score; for anti-affinity, it subtracts the weight from the Node's score.

Architectural Warning (Computational Complexity): Because PodAffinity and PodAntiAffinity require the scheduler to cross-reference the labels of the incoming Pod against the labels of all existing Pods across all mapped topology domains, the algorithm is computationally expensive (often scaling at $O(N^2)$ relative to the number of Pods). The official Kubernetes project warns that these operations require substantial amounts of processing and can significantly slow down scheduling in clusters larger than a few hundred nodes.

A Mathematical Deep Dive into Node Scoring

If multiple Nodes survive the Filter phase, the scheduler moves to the Score phase to rank them. The scheduler runs a series of configured scoring plugins, each returning a score for every Node within a well-defined range of integers (typically 0 to 100). After normalizing these scores, the scheduler calculates a final mathematical weighted average using the configured plugin weights.

To understand how this math works in practice, let's examine the Resource Bin Packing algorithm, specifically the RequestedToCapacityRatio scoring strategy.

The `RequestedToCapacityRatio` Algorithm

This strategy scores nodes based on a configured mathematical function of their allocated resources. It allows administrators to bin-pack resources (like CPU, memory, and extended resources like GPUs) to improve utilization of scarce hardware.

The algorithm uses a shape array to map a Node's resource utilization percentage to a specific score. For example, to favor bin-packing (packing Pods tightly onto fewer Nodes), an administrator might configure a shape mapping where 0% utilization = a score of 0, and 100% utilization = a score of 10.

Step-by-Step Scoring Example

Assume we have an incoming Pod requesting the following resources, and the cluster administrator has assigned specific weights to each resource to dictate their importance:

Incoming Pod Requests: intel.com/foo: 2, memory: 256MB, cpu: 2.
Configured Resource Weights: intel.com/foo: 5, memory: 1, cpu: 3.
Scoring Function Shape: {0% utilization = 0 score}, {100% utilization = 10 score}.

Let's evaluate a feasible Node (Node 1) with the following specifications:

Node 1 Capacity (Available): intel.com/foo: 4, memory: 1024MB, cpu: 8.
Node 1 Currently Used (Before new Pod): intel.com/foo: 1, memory: 256MB, cpu: 1.

The scheduler calculates the score for each individual resource by finding the projected utilization percentage if the Pod were scheduled there, and mapping it to the shape function.

1. Scoring intel.com/foo:

Projected Usage = Used (1) + Requested (2) = 3.
Utilization Percentage = (3 / 4) * 100 = 75%.
Mapped Score = rawScoringFunction(75%). Since 100% = 10, 75% yields a raw score of 7 (using integer flooring).

2. Scoring memory:

Projected Usage = Used (256MB) + Requested (256MB) = 512MB.
Utilization Percentage = (512 / 1024) * 100 = 50%.
Mapped Score = rawScoringFunction(50%) = 5.

3. Scoring cpu:

Projected Usage = Used (1) + Requested (2) = 3.
Utilization Percentage = (3 / 8) * 100 = 37.5%.
Mapped Score = rawScoringFunction(37.5%) = 3.

4. Calculating the Final Node Score: Once the individual resource scores are calculated, the scheduler aggregates them using the configured resource weights to generate the final Node score. The formula is:

$$NodeScore = \frac{\sum (ResourceScore \times Weight)}{\sum Weights}$$

Plugging in our calculations for Node 1: $$NodeScore = \frac{(7 \times 5) + (5 \times 1) + (3 \times 3)}{5 + 1 + 3}$$ $$NodeScore = \frac{35 + 5 + 9}{9}$$ $$NodeScore = \frac{49}{9} = 5.44$$

Using integer math, the final score for Node 1 is 5.

The scheduler repeats this exact algorithmic process for every feasible Node in the cluster (e.g., yielding a score of 7 for Node 2). Finally, the kube-scheduler assigns the Pod to the Node with the highest final ranking. If there is a tie, the scheduler selects one of the top-scoring Nodes at random.

How does kube-scheduler score nodes, and how do NodeAffinity, PodAffinity, and PodAntiAffinity differ in the Filter and Score phases?

The Algorithmic Differences: NodeAffinity vs. PodAffinity vs. PodAntiAffinity ​

1. NodeAffinity ​

2. PodAffinity and PodAntiAffinity ​

A Mathematical Deep Dive into Node Scoring ​