Appearance
Deep Dive: Cgroups Architecture & v2 Evolution
Control groups (cgroups) are the Linux kernel feature that allows you to constrain resources—such as CPU and memory—allocated to processes. In the context of Kubernetes, they serve as the fundamental mechanism for resource management and isolation.
Origins and Evolution
While the specific kernel development history of cgroups is complex, their operational lineage is directly tied to Google's history with container management systems.
- The Google Heritage: Kubernetes was born from over 15 years of Google's experience running production workloads at scale, specifically influenced by internal systems named Borg and Omega.
- The Need for Isolation: In the "Traditional deployment era," applications ran on physical servers without resource boundaries, causing resource allocation issues where one app could starve others. The "Container deployment era" solved this by using containers, which share the Operating System (OS) but use relaxed isolation properties.
- The Mechanism: To achieve this isolation on Linux, the kubelet and container runtimes interface with cgroups to enforce limits and requests for resources like CPU and memory.
Architecture
On Linux nodes, cgroups act as a "pod boundary" for resource control. Containers are created within that boundary to ensure network, process, and filesystem isolation.
1. Enforcement Mechanisms
The architecture relies on the Linux kernel to enforce the limits defined in your Pod specifications:
- CPU Limits: These are enforced via throttling. When a container approaches its CPU limit, the kernel restricts access to CPU cycles. It is a "hard limit".
- Memory Limits: These are enforced reactively. If a container exceeds its memory limit, the kernel may terminate the process using the Out of Memory (OOM) killer.
2. Cgroup Drivers (cgroupfs vs. systemd)
To interface with cgroups, Kubernetes components (kubelet and container runtime) use a cgroup driver. It is critical that both components use the same driver.
cgroupfs: This driver interfaces directly with the cgroup filesystem. It is not recommended whensystemdis the init system because it results in two different cgroup managers (systemd and cgroupfs) having different views of resources, leading to instability.systemd: This driver is recommended for systemd-based Linux distributions. Since systemd already generates and consumes a root control group, using this driver ensures a single, tight integration for resource allocation.
Difference between Cgroup v1 and v2
Control groups (cgroups) are the Linux kernel primitive used to constrain resources (CPU, memory, I/O) allocated to processes. While both versions serve this fundamental purpose, cgroup v2 represents a complete rewrite of the API to address the architectural limitations of v1.
1. Architectural Design: Unified vs. Split Hierarchy
The most significant structural difference lies in how the hierarchies are organized.
- Cgroup v1: Relies on a separate hierarchy for each controller. This meant that a process could belong to one group for CPU limits and a completely different group for Memory limits. This complexity made resource coordination difficult.
- Cgroup v2: Utilizes a single unified hierarchy design in the API. All resource controllers are mounted into a single tree. This unification allows for better coordination between controllers and simpler management of resource constraints across the system.
2. Resource Management Improvements
Cgroup v2 introduces sophisticated accounting and isolation mechanisms that were technically unfeasible in v1:
- Unified Accounting: v2 provides unified accounting for different types of memory allocations, specifically including network memory and kernel memory, rather than just user-space memory.
- Page Cache Write Backs: v2 accounts for non-immediate resource changes, such as page cache write backs, preventing I/O spikes from one container affecting the stability of others.
- Safer Delegation: v2 offers safer sub-tree delegation to containers, allowing a container orchestrator to securely delegate resource management rights to a container payload.
3. Kubernetes Feature Availability
From an engineering perspective, several advanced Kubernetes features exclusively require cgroup v2 because they rely on primitives that simply do not exist in v1.
- Memory QoS: Kubernetes uses the v2 primitives
memory.minandmemory.highto implement MemoryQoS. This allows for throttling workloads approaching their limit and guaranteeing memory availability, rather than relying solely on the OOM killer. - Pressure Stall Information (PSI): This feature provides metrics on CPU, memory, and I/O pressure (stalls). The Kubelet can expose these metrics for monitoring, but this requires cgroup v2 and Linux kernel 4.20 or later.
- Rootless Mode: Running Kubernetes Node components (like the kubelet) in a user namespace without root privileges requires cgroup v2.
4. Operational Status and Deprecation
As of Kubernetes v1.35, the project has formally shifted support away from cgroup v1.
- Cgroup v1 Status: Deprecated. Removal will follow the standard deprecation policy. Crucially, the Kubelet will no longer start on a cgroup v1 node by default unless explicitly configured to tolerate it.
- Cgroup v2 Status: Stable (since Kubernetes v1.25).
Identifying the Version
You can determine which version a Linux node is using by querying the filesystem type of the cgroup mount point.
bash
stat -fc %T /sys/fs/cgroup/- Output
cgroup2fs: The node is using cgroup v2. - Output
tmpfs: The node is using cgroup v1.
Requirements for v2
To successfully adopt cgroup v2, your infrastructure must meet specific requirements:
- OS: The Linux distribution must enable cgroup v2.
- Kernel: Linux Kernel 5.8 or later is recommended.
- Runtime: The container runtime must support it (e.g., containerd v1.4+, CRI-O v1.20+).
- Driver: The Kubelet and runtime should be configured to use the
systemdcgroup driver.
Windows Architecture
It is worth noting that cgroups are a Linux-specific feature. On Windows nodes, resource control is handled differently:
- Job Objects: Windows uses a job object per container with a system namespace filter to contain processes and provide logical isolation.
- Limits: While Windows can limit CPU usage, it cannot guarantee a minimum amount of CPU time in the same way Linux cgroups can.