How do Kubernetes resource limits translate to Linux kernel primitives?

Based on the provided documentation, here is a detailed explanation of performance and resource management in Kubernetes, focusing on how abstract API concepts translate to Linux kernel primitives.

1. Understanding CPU Scheduling

In Kubernetes, CPU is considered a "compressible" resource. If a container requests more CPU than is available, it is not terminated; instead, it is throttled (slowed down).

CPU Shares vs. Quotas

The Kubelet and container runtime configure Linux control groups (cgroups) to enforce the requests and limits defined in your Pod specification.

CPU Requests (Shares/Weights):
- Concept: A request defines a guaranteed minimum amount of CPU time and a relative weighting for scheduling.
- Mechanism: This maps to cpu.shares (cgroup v1) or cpu.weight (cgroup v2).
- Behavior: If the CPU is idle, a container can consume as much CPU as needed (up to its limit). If the CPU is contended, the Linux kernel scheduler (CFS) allocates CPU time proportionally based on these shares. A container with 1000m request gets roughly double the CPU time of a container with 500m during contention.
CPU Limits (Quotas):
- Concept: A limit defines a hard ceiling on CPU usage.
- Mechanism: This maps to the CFS Quota mechanism (e.g., cpu.cfs_quota_us and cpu.cfs_period_us).
- Behavior: The kernel tracks CPU usage over a specific period (usually 100ms). If a container exhausts its quota within that period, the kernel strictly stops the process from running until the next period begins.

Throttling Behavior

Throttling occurs when a container attempts to use more CPU than its configured limit.

The Symptom: The application does not crash, but latency increases significantly because the process is paused by the kernel scheduler.
Best Practice: For latency-sensitive workloads, careful tuning of limits is required. If strict latency guarantees are needed, you may use the CPU Manager with the static policy. This allows containers in Guaranteed QoS pods (where request = limit and is an integer) to be assigned exclusive CPUs, bypassing the CFS quota overhead entirely.

2. Memory Management

Unlike CPU, memory is an "incompressible" resource. You cannot throttle memory; if you run out, the kernel must terminate a process to recover stability.

Memory Limits and OOM Behavior

Enforcement: limits.memory sets a hard cap on memory usage via cgroups.
OOM Kill: If a container attempts to allocate memory beyond its limit, the Linux kernel invokes the Out of Memory (OOM) Killer.
- Behavior: The kernel typically kills the process inside the container that attempted the allocation. If PID 1 is killed, the container crashes and Kubernetes restarts it (depending on restartPolicy).
OOM Scores: The Kubelet configures an oom_score_adj for processes to influence which ones are killed first during node-level memory pressure. This is based on tracking the Resource Quality of Service.

Page Cache vs. RSS

Kubernetes memory metrics (and enforcement) rely on the concept of the Working Set.

Composition: The working set typically includes Anonymous Memory (RSS - memory used by the application code) plus Cached Memory (file-backed pages used for I/O operations) that the kernel cannot easily reclaim.
Implication: Applications performing heavy I/O may appear to use more memory than expected because the kernel caches files in the container's memory cgroup. If the container hits its limit due to cache usage, the kernel attempts to reclaim pages. If it cannot reclaim enough, an OOM kill occurs.

3. I/O Performance

Kubernetes manages storage capacity natively, but performance (I/O throughput and IOPS) is largely managed via abstractions and vendor-specific configurations.

Block I/O Limits

Ephemeral Storage: The Kubelet can limit local ephemeral storage (logs, emptyDir, writable layers). If a Pod uses more disk space than its limit, the Kubelet evicts the Pod.
Block Level Enforcement: While cgroups have a blkio controller for throttling disk I/O, Kubernetes resources specifications generally focus on capacity (bytes) rather than throughput (IOPS).

Monitoring Pressure Stall Information (PSI)

To better understand resource contention (including CPU, Memory, and I/O), you can enable PSI metrics in the Kubelet.

Concept: PSI provides metrics on how long processes are stalled waiting for resources.
I/O Stalls: container_pressure_io_stalled_seconds_total tells you exactly how much time a container spent waiting for block I/O, which is critical for diagnosing storage performance issues that don't result in crashes but do cause slowness.

How do Kubernetes resource limits translate to Linux kernel primitives?

1. Understanding CPU Scheduling ​

CPU Shares vs. Quotas ​

Throttling Behavior ​

2. Memory Management ​

Memory Limits and OOM Behavior ​

Page Cache vs. RSS ​

3. I/O Performance ​

Block I/O Limits ​

Monitoring Pressure Stall Information (PSI) ​