Appearance
Kubernetes Autoscaling: HPA, VPA & Cluster Autoscaler Guide
How does Kubernetes support autoscaling (HPA, VPA, Cluster Autoscaler), and what signals drive scaling decisions?
Let's break down the autoscaling ecosystem. Effective scaling requires understanding that Kubernetes operates on two distinct layers: the Workload Layer (Pods) and the Infrastructure Layer (Nodes).
To build a resilient platform, you must coordinate three primary mechanisms: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Node Autoscaling (Cluster Autoscaler or Karpenter).
1. Horizontal Pod Autoscaler (HPA)
Goal: Scale the number of replicas to match demand (scaling out).
HPA is a control loop that scales a workload resource (like a Deployment or StatefulSet) based on specific metrics,. It is the standard mechanism for handling traffic spikes.
How it works: The HPA controller, running in the kube-controller-manager, executes a control loop (default every 15 seconds),. It calculates the desired number of replicas using the following ratio: $$ \text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil $$ .
Signals driving HPA:
- Resource Metrics (Standard): The most common signals are CPU and Memory usage. HPA fetches these from the Metrics API (usually provided by the metrics-server add-on),. The controller calculates utilization as a percentage of the Pod's resource requests.
- Container Resource Metrics: HPA can scale based on the usage of individual containers rather than the aggregate Pod usage, which is useful for Pods with sidecars (e.g., a logging sidecar shouldn't trigger scaling for the main app).
- Custom and External Metrics: Using the
custom.metrics.k8s.ioorexternal.metrics.k8s.ioAPIs, HPA can scale on signals unrelated to Kubernetes objects, such as:- Network traffic: Packets per second or Ingress hits,.
- Queue depth: Messages pending in an external cloud service (like SQS or Pub/Sub).
- Event-Driven: Tools like KEDA (Kubernetes Event Driven Autoscaler) can feed these external event signals into the HPA.
Stability Mechanisms: To prevent "flapping" (rapidly scaling up and down), HPA uses a stabilization window (default 5 minutes for scale-down) to smooth out noisy metrics before removing Pods,.
2. Vertical Pod Autoscaler (VPA)
Goal: Adjust the resource requests and limits of containers to match actual usage (scaling up).
VPA is designed for "rightsizing" workloads. It frees you from manually tuning CPU/Memory requests, which is critical because incorrect requests can lead to either waste (over-provisioning) or instability/throttling (under-provisioning),.
How it works: VPA consists of three components:
- Recommender: Analyzes historical resource usage history to calculate optimal values.
- Updater: Decides if a Pod needs to be updated to apply new resources.
- Admission Controller: Overrides resource requests on new Pods as they are created.
Signals driving VPA: VPA relies primarily on historical usage data (CPU and Memory) and real-time events like OOM (Out of Memory) kills. It looks at trends over time to determine the lower bound (minimum viable) and upper bound (maximum reasonable) for resources.
Update Modes:
- Off: Calculates recommendations but does not apply them (dry-run).
- Initial: Applies resources only when a Pod is created. It does not change running Pods.
- Recreate: Evicts running Pods if their requests deviate significantly from the recommendation. The workload controller then recreates the Pod with new values.
- InPlaceOrRecreate: Attempts to resize the Pod in-place (without restarting) if the underlying node has capacity. If in-place resize isn't possible, it falls back to eviction.
3. Node Autoscaling (Cluster Autoscaler / Karpenter)
Goal: Adjust the number of Nodes in the cluster to accommodate Pods.
Node autoscaling bridges the gap between your abstract workloads and the physical/virtual infrastructure.
How it works: Unlike HPA/VPA, Node Autoscalers do not typically look at CPU/Memory usage percentages of the nodes. Instead, they focus on schedulability and allocation.
Signals driving Node Scaling:
- Provisioning (Scale Up): The primary signal is the presence of unschedulable Pods (Pods in a
Pendingstate). If a Pod cannot be scheduled because no Node has enough free requested capacity (or fails affinity/taint rules), the autoscaler provisions a new Node,. - Consolidation (Scale Down): The autoscaler watches for nodes that are underutilized. A node is a candidate for removal if:
- All Pods running on it can be rescheduled onto other existing nodes.
- It effectively monitors the sum of Pod requests against Node capacity, not the live CPU usage.
Implementations:
- Cluster Autoscaler: The traditional standard. It modifies cloud provider Node Groups (e.g., ASGs).
- Karpenter: A newer, more direct approach that provisions nodes directly (bypassing node groups) to exactly fit the pending Pods' constraints.
Summary of Autoscaling Interaction
The interaction between these layers forms a complete elasticity loop:
- Workload expands: Traffic increases → HPA detects metric spike → HPA increases replica count.
- Infrastructure reacts: New Pods are created but stay
Pendingbecause nodes are full → Cluster Autoscaler detects pending Pods → Provisions new Nodes. - Workload shrinks: Traffic drops → HPA reduces replica count.
- Infrastructure consolidates: Nodes become underutilized (low request allocation) → Cluster Autoscaler drains and removes empty nodes.
Engineering Warning: Do not use HPA and VPA on the same metric (e.g., CPU) for the same workload. They will fight each other: HPA will try to add pods to lower average usage, while VPA will try to lower requests because per-pod usage dropped, leading to instability. Use HPA for demand (requests per second) and VPA for rightsizing (CPU/Mem), or separate them entirely.