Skip to content

How does Kubernetes support autoscaling (HPA, VPA, Cluster Autoscaler), and what signals drive scaling decisions?

Hello. As a Senior Kubernetes Platform Engineer, I will break down the autoscaling ecosystem. Effective scaling requires understanding that Kubernetes operates on two distinct layers: the Workload Layer (Pods) and the Infrastructure Layer (Nodes).

To build a resilient platform, you must coordinate three primary mechanisms: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Node Autoscaling (Cluster Autoscaler or Karpenter).

1. Horizontal Pod Autoscaler (HPA)

Goal: Scale the number of replicas to match demand (scaling out).

HPA is a control loop that scales a workload resource (like a Deployment or StatefulSet) based on specific metrics,. It is the standard mechanism for handling traffic spikes.

How it works: The HPA controller, running in the kube-controller-manager, executes a control loop (default every 15 seconds),. It calculates the desired number of replicas using the following ratio: $$ \text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil $$ .

Signals driving HPA:

  • Resource Metrics (Standard): The most common signals are CPU and Memory usage. HPA fetches these from the Metrics API (usually provided by the metrics-server add-on),. The controller calculates utilization as a percentage of the Pod's resource requests.
  • Container Resource Metrics: HPA can scale based on the usage of individual containers rather than the aggregate Pod usage, which is useful for Pods with sidecars (e.g., a logging sidecar shouldn't trigger scaling for the main app).
  • Custom and External Metrics: Using the custom.metrics.k8s.io or external.metrics.k8s.io APIs, HPA can scale on signals unrelated to Kubernetes objects, such as:
    • Network traffic: Packets per second or Ingress hits,.
    • Queue depth: Messages pending in an external cloud service (like SQS or Pub/Sub).
    • Event-Driven: Tools like KEDA (Kubernetes Event Driven Autoscaler) can feed these external event signals into the HPA.

Stability Mechanisms: To prevent "flapping" (rapidly scaling up and down), HPA uses a stabilization window (default 5 minutes for scale-down) to smooth out noisy metrics before removing Pods,.

2. Vertical Pod Autoscaler (VPA)

Goal: Adjust the resource requests and limits of containers to match actual usage (scaling up).

VPA is designed for "rightsizing" workloads. It frees you from manually tuning CPU/Memory requests, which is critical because incorrect requests can lead to either waste (over-provisioning) or instability/throttling (under-provisioning),.

How it works: VPA consists of three components:

  1. Recommender: Analyzes historical resource usage history to calculate optimal values.
  2. Updater: Decides if a Pod needs to be updated to apply new resources.
  3. Admission Controller: Overrides resource requests on new Pods as they are created.

Signals driving VPA: VPA relies primarily on historical usage data (CPU and Memory) and real-time events like OOM (Out of Memory) kills. It looks at trends over time to determine the lower bound (minimum viable) and upper bound (maximum reasonable) for resources.

Update Modes:

  • Off: Calculates recommendations but does not apply them (dry-run).
  • Initial: Applies resources only when a Pod is created. It does not change running Pods.
  • Recreate: Evicts running Pods if their requests deviate significantly from the recommendation. The workload controller then recreates the Pod with new values.
  • InPlaceOrRecreate: Attempts to resize the Pod in-place (without restarting) if the underlying node has capacity. If in-place resize isn't possible, it falls back to eviction.

3. Node Autoscaling (Cluster Autoscaler / Karpenter)

Goal: Adjust the number of Nodes in the cluster to accommodate Pods.

Node autoscaling bridges the gap between your abstract workloads and the physical/virtual infrastructure.

How it works: Unlike HPA/VPA, Node Autoscalers do not typically look at CPU/Memory usage percentages of the nodes. Instead, they focus on schedulability and allocation.

Signals driving Node Scaling:

  1. Provisioning (Scale Up): The primary signal is the presence of unschedulable Pods (Pods in a Pending state). If a Pod cannot be scheduled because no Node has enough free requested capacity (or fails affinity/taint rules), the autoscaler provisions a new Node,.
  2. Consolidation (Scale Down): The autoscaler watches for nodes that are underutilized. A node is a candidate for removal if:
    • All Pods running on it can be rescheduled onto other existing nodes.
    • It effectively monitors the sum of Pod requests against Node capacity, not the live CPU usage.

Implementations:

  • Cluster Autoscaler: The traditional standard. It modifies cloud provider Node Groups (e.g., ASGs).
  • Karpenter: A newer, more direct approach that provisions nodes directly (bypassing node groups) to exactly fit the pending Pods' constraints.

Summary of Autoscaling Interaction

The interaction between these layers forms a complete elasticity loop:

  1. Workload expands: Traffic increases → HPA detects metric spike → HPA increases replica count.
  2. Infrastructure reacts: New Pods are created but stay Pending because nodes are full → Cluster Autoscaler detects pending Pods → Provisions new Nodes.
  3. Workload shrinks: Traffic drops → HPA reduces replica count.
  4. Infrastructure consolidates: Nodes become underutilized (low request allocation) → Cluster Autoscaler drains and removes empty nodes.

Engineering Warning: Do not use HPA and VPA on the same metric (e.g., CPU) for the same workload. They will fight each other: HPA will try to add pods to lower average usage, while VPA will try to lower requests because per-pod usage dropped, leading to instability. Use HPA for demand (requests per second) and VPA for rightsizing (CPU/Mem), or separate them entirely.