Appearance
What are the security primitives used to harden containerized workloads?
Kubernetes security relies on Defense in Depth. While namespaces and cgroups provide the basic structure of a container, the primitives below—Capabilities, Seccomp, AppArmor/SELinux, and User Namespaces—secure the boundary between the container and the host kernel.
1. Linux Capabilities
Concept: Breaking "Root" into Pieces. Traditionally, the Unix root user (UID 0) has unlimited power. Linux Capabilities break this power into distinct units. This allows you to grant a process the specific privileges it needs (like binding to a low-numbered network port) without giving it full superuser access to the entire system.
- What they are: A granular set of permissions (e.g.,
CAP_CHOWN,CAP_KILL). - Common Capabilities:
NET_ADMIN: Allows interface configuration, firewall rule manipulation, and routing table modification.SYS_ADMIN: The "catch-all" capability. It is very powerful and allows mounting filesystems, usingptrace, and more. Avoid granting this if possible.NET_BIND_SERVICE: Allows binding to ports below 1024 (like port 80).
- Dropping Capabilities: A best practice is to drop all capabilities and only add back the ones strictly necessary. This minimizes the "blast radius" if an attacker compromises the container.
Example: Dropping ALL and adding specific networking permission
yaml
apiVersion: v1
kind: Pod
metadata:
name: security-context-demo
spec:
containers:
- name: sec-ctx-demo
image: nginx
securityContext:
capabilities:
drop:
- ALL # Best Practice: Start with zero privileges
add:
- NET_ADMIN # Only grant specific network admin rights2. Seccomp (Secure Computing Mode)
Concept: A Firewall for System Calls. While capabilities control permission, Seccomp controls action. It filters the system calls (syscalls) a process can make to the Linux kernel. To understand what these calls look like under the hood, learn how to trace them using strace.
- System Call Filtering: Every time a program opens a file, accepts a connection, or exits, it makes a syscall. Seccomp restricts which of these calls are allowed. If a containerized application tries to use a restricted syscall (like
reboot), the kernel blocks it. - Profiles:
Unconfined: No filtering (Dangerous).RuntimeDefault: A sane default profile provided by the container runtime. This blocks highly dangerous calls and is the recommended baseline.Localhost: A custom profile defined in a JSON file on the node.
Example: Enforcing the RuntimeDefault profile
yaml
apiVersion: v1
kind: Pod
metadata:
name: default-seccomp
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # Uses the container runtime's safe default
containers:
- name: test-container
image: nginx3. AppArmor / SELinux (Mandatory Access Control)
Concept: Policy-Driven Access Control. These are kernel modules that act as a second layer of enforcement. Even if a user has "permission" (via standard Linux permissions) to touch a file, MAC (Mandatory Access Control) can block it based on a security profile.
AppArmor (Application Armor):
- Uses Path-based profiles. It restricts what files an executable can read, write, or execute.
- K8s Usage: You load profiles onto the node (e.g.,
k8s-nginx) and reference them in the Pod spec. - Modes: Profiles can be in Enforce (block actions) or Complain (log actions) mode.
SELinux (Security-Enhanced Linux):
- Uses Label-based controls. Every file and process has a label (e.g.,
container_t). Policies dictate how labels interact. - K8s Usage: Kubernetes can automatically mount volumes with the correct SELinux context using the
seLinuxOptionsfield or mount options.
- Uses Label-based controls. Every file and process has a label (e.g.,
4. User Namespaces and UID Mapping
Concept: Identity Isolation (The Ultimate Illusion). User Namespaces decouple the user ID (UID) inside the container from the UID on the host.
- The Mechanism:
- Inside the container: The process looks like
root(UID 0). It can install packages and modify files owned by root inside the container image. - Outside (On the Host): The kernel maps UID 0 inside to an unprivileged UID (e.g., UID 65534) on the host.
- Inside the container: The process looks like
- Security Implications: This massively reduces risk. If an attacker manages to "break out" of the container, they find themselves as a powerless user on the host node, unable to access system files or kill processes.
- Configuration: Enabled via the
hostUsersfield in the Pod spec (requires Kubernetes v1.25+ and specific feature gates).
Summary Comparison
| Primitive | Controls... | Best Practice |
|---|---|---|
| Capabilities | Privilege Power. (Can I modify network settings?) | Drop ALL, add only needed. |
| Seccomp | Kernel Actions. (Can I call reboot or swapon?) | Use RuntimeDefault. |
| AppArmor/SELinux | Resource Access. (Can I write to /etc?) | Use standard profiles; avoid Unconfined. |
| User Namespaces | User Identity. (Am I really root?) | Enable hostUsers: false where supported. |