Skip to content

Filesystem & Storage Architecture

How do Kubernetes filesystems and volume mounts work under the hood?

Containers are not monolithic blocks of data. They are dynamically constructed from stacked images and host-mounted directories. Understanding how the underlying Linux OS manages this is critical for debugging "Permission Denied" errors and diagnosing local disk exhaustion.


1. Container Image Layers (The Union Filesystem)

Container runtimes build filesystems using a Union Filesystem (such as overlay2 on Linux).

  • Read-Only Image Layers: A container image is composed of multiple read-only layers stacked on top of each other. These layers contain the application code, dependencies, and base operating system files. These layers are completely immutable; once built, they never change.
  • The Writable Container Layer: When a container starts, the runtime adds a thin, volatile "writable layer" (often called the containerfs) directly on top of the read-only image layers. Any changes made by the running application—creating new logs, modifying config files, or deleting data—happen exclusively in this writable layer.
  • Copy-on-Write (CoW): If a container attempts to modify a file that exists in the read-only image layer below it, the Union Filesystem intercepts the request. It first copies the original file up into the writable layer, and then applies the modification to the copy. The original file in the read-only layer remains untouched.
    • Advantage: This efficiency allows fifty different containers to share the exact same underlying read-only image data on disk while maintaining their own independent, writable state.

2. The Kubelet's Working Directory

The /var/lib/kubelet directory is the primary state directory for the Kubelet agent running on every worker node. It is where Kubernetes stores all data related to the Pods scheduled on that specific machine.

  • The Pods Subdirectory: The Kubelet manages the lifecycle of Pods under /var/lib/kubelet/pods/<Pod-UID>/. Within a Pod's specific directory, you will find its mounted volumes, injected plugins, and internal configuration files.
  • Local Volumes: If you configure an emptyDir volume, it is physically provisioned within this directory structure on the host's primary hard drive.

    Disk Exhaustion Risk

    If the host partition holding /var/lib/kubelet fills up because a poorly configured Pod writes massive log files to an emptyDir, the Kubelet will run an immediate eviction routine, killing Pods on that node to reclaim disk space.


3. Mapping Kubernetes Volumes to Linux

Kubernetes volumes exist to allow data to survive container restarts (or be shared between sidecars). The Kubelet maps these high-level APIs directly to standard Linux mount concepts.

  • Bind Mounts (hostPath): A hostPath volume instructs the Kubelet to perform a literal Linux bind mount. It maps a specific directory from the physical node's filesystem (e.g., /etc/foo) directly into the container's namespace.
  • RAM Disks (medium: "Memory"): If you create an emptyDir volume and explicitly set medium: "Memory", the Kubelet instructs the OS to mount a tmpfs (RAM-backed filesystem) into the container. This is exceptionally fast, but volatile.
    • Note: ConfigMaps and Secrets are typically injected by the Kubelet using these exact tmpfs mounts to ensure sensitive values are never accidentally written to static disk.

4. Navigating Filesystem Permissions

Permissions in Kubernetes are strictly controlled via the Pod's securityContext.

By default, many legacy containers attempt to run as root (UID 0). Modern security practices enforce running processes as unprivileged users (e.g., UID 1000). This frequently causes file ownership conflicts when dealing with mounted volumes.

The "Permission Denied" Scenario: A common failure occurs when an emptyDir or persistent volume is mounted into a container running as a non-root user (UID 1000). The underlying volume on the host was provisioned and is owned by root. When the container process attempts to write to the mount path, the Linux kernel denies access.

The Solution: fsGroup: To resolve this, Kubernetes provides the fsGroup setting within the securityContext.

yaml
spec:
  securityContext:
    runAsUser: 1000
    fsGroup: 2000  # <--- This fixes the permission conflict
  volumes:
  - name: my-data
    emptyDir: {}

When fsGroup is specified, the Kubelet intercepts the volume mounting process and performs two explicit actions:

  1. Recursively changes the Group ID (GID) ownership of the target volume to match the specified fsGroup (e.g., GID 2000).
  2. Sets the setgid bit so any new files created inside that volume will automatically inherit that same GID.

Because the container process is automatically made a member of fsGroup 2000 at startup, it gains full write access to the volume without requiring root privileges.

Based on Kubernetes v1.35 (Timbernetes). Changelog.