Skip to content

Union Filesystems

How do Union Filesystems (like overlayfs and aufs) power container image layers and Copy-on-Write (CoW)?

Container images are designed to be strictly stateless and immutable. An image represents a bundle of binary data that encapsulates an application and all its software dependencies, making well-defined assumptions about its runtime environment.

Instead of being monolithic blocks of data, images are constructed from multiple independent layers. This layered architecture provides massive efficiencies in networking and storage. When the kubelet instructs the container runtime to pull an image, the runtime can recognize which specific image layers already exist on the node's local cache. It will only download the missing layers, rather than pulling the entire multi-gigabyte application stack every time.

Kubernetes delegates the management of these layers to the underlying container runtime (such as containerd or CRI-O). To present these multiple distinct layers as a single cohesive root filesystem to the application, container runtimes utilize a technology known as a unioning file system.

Union Filesystems (aufs, overlayfs)

A unioning file system combines the namespaces of two or more distinct file systems together to produce a single merged namespace. Historically, implementations for Linux have included unionfs and aufs (introduced around 2006). Modern container runtimes predominantly use overlayfs (or variations like fuse-overlayfs) natively integrated into the Linux kernel.

Union filesystems rely on the concept of "branches," which are the various file systems that are unioned together. In a containerized architecture, the branches are stacked: the underlying container image layers form the branches "on the bottom" and are strictly read-only, while a thin, temporary branch is placed "on top" to act as the writable overlay for the running container.

How Image Layers Work

When a Pod is launched, a process inside the container sees a root filesystem view composed of the initial contents of its container image, merged by the union filesystem. If a file exists in multiple branches, the union filesystem rules dictate visibility, ensuring the upper layers supersede the lower layers.

Because the underlying image layers are immutable, deleting files requires special mechanics. If a process running in the container deletes a file that originated from a lower read-only branch, the union filesystem must ensure that the directory entry never appears again in the merged namespace. This is achieved using two primary mechanisms:

  1. Whiteouts: A whiteout is a special directory entry (often with a reserved name like .wh.<filename>) created in the upper writable branch. This marker actively covers up and hides all entries of that particular name from the lower read-only branches.
  2. Opaque Directories: An opaque directory operates similarly but at the directory level. It acts as a shield that prevents any part of the namespace from the lower branches from showing through from that directory downwards.

The Copy-on-Write (CoW) Mechanism

The interaction between the read-only image layers and the upper writable layer is governed by a Copy-on-Write (CoW) strategy.

Because the underlying image branches cannot be changed, any modifications must be intercepted. When a file on a read-only branch is altered by the container, it must first be "copied up" to the writable branch. This copy-up operation is triggered by the kernel either when the file is opened with write permissions, or the instant the first write to the file's data or metadata occurs.

Once the file is copied into the upper writable layer, the container modifies this newly copied version. Any subsequent writes to that filesystem hierarchy affect what the process views, but the original file in the lower image layer remains completely untouched.

Operational Workflows and Disk Pressure

Understanding this filesystem architecture directly impacts how you configure local ephemeral storage and manage node health in a Kubernetes cluster:

  • Data Loss on Crash: On-disk files in a container's writable layer are entirely ephemeral. If a container crashes or is stopped, its state is not saved. All files created or modified during the lifetime of the container are permanently lost when the kubelet restarts the container with a clean state.
  • Disk Pressure and Eviction: The kubelet actively tracks the amount of local ephemeral storage consumed by writeable container layers and the underlying container images. Kubernetes nodes can be configured to split these filesystems: a nodefs for the main filesystem, an optional imagefs dedicated strictly to container images (the read-only layers), and an optional containerfs dedicated to the writeable layers.
  • Performance Implications: If an application heavily utilizes the Copy-on-Write mechanism (for example, constantly modifying large files that originated in the read-only image, or writing excessive logs to its root filesystem), it will rapidly consume the node's disk space. If a container's writable layer usage exceeds its configured storage limit, the kubelet will mark the Pod for eviction to protect the node from disk pressure.

Because of the performance overhead of copying files between layers and the risk of triggering node evictions, the standard architectural best practice is to avoid writing application data to the container's local filesystem. Instead, workloads should utilize Kubernetes Volumes (such as emptyDir or PersistentVolumeClaims), which completely bypass the container runtime's union filesystem and provide native disk performance that is safe across container restarts.

Based on Kubernetes v1.35 (Timbernetes). Changelog.