What are the security primitives used to harden containerized workloads?

Kubernetes security relies on Defense in Depth. While namespaces and cgroups provide the basic structure of a container, the primitives below—Capabilities, Seccomp, AppArmor/SELinux, and User Namespaces—secure the boundary between the container and the host kernel.

1. Linux Capabilities

Concept: Breaking "Root" into Pieces. Traditionally, the Unix root user (UID 0) has unlimited power. Linux Capabilities break this power into distinct units. This allows you to grant a process the specific privileges it needs (like binding to a low-numbered network port) without giving it full superuser access to the entire system.

What they are: A granular set of permissions (e.g., CAP_CHOWN, CAP_KILL).
Common Capabilities:
- NET_ADMIN: Allows interface configuration, firewall rule manipulation, and routing table modification.
- SYS_ADMIN: The "catch-all" capability. It is very powerful and allows mounting filesystems, using ptrace, and more. Avoid granting this if possible.
- NET_BIND_SERVICE: Allows binding to ports below 1024 (like port 80).
Dropping Capabilities: A best practice is to drop all capabilities and only add back the ones strictly necessary. This minimizes the "blast radius" if an attacker compromises the container.

Example: Dropping ALL and adding specific networking permission

yaml

apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  containers:
  - name: sec-ctx-demo
    image: nginx
    securityContext:
      capabilities:
        drop:
        - ALL          # Best Practice: Start with zero privileges
        add:
        - NET_ADMIN    # Only grant specific network admin rights

2. Seccomp (Secure Computing Mode)

Concept: A Firewall for System Calls. While capabilities control permission, Seccomp controls action. It filters the system calls (syscalls) a process can make to the Linux kernel. To understand what these calls look like under the hood, learn how to trace them using strace.

System Call Filtering: Every time a program opens a file, accepts a connection, or exits, it makes a syscall. Seccomp restricts which of these calls are allowed. If a containerized application tries to use a restricted syscall (like reboot), the kernel blocks it.
Profiles:
- Unconfined: No filtering (Dangerous).
- RuntimeDefault: A sane default profile provided by the container runtime. This blocks highly dangerous calls and is the recommended baseline.
- Localhost: A custom profile defined in a JSON file on the node.

Example: Enforcing the RuntimeDefault profile

yaml

apiVersion: v1
kind: Pod
metadata:
  name: default-seccomp
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault  # Uses the container runtime's safe default
  containers:
  - name: test-container
    image: nginx

3. AppArmor / SELinux (Mandatory Access Control)

Concept: Policy-Driven Access Control. These are kernel modules that act as a second layer of enforcement. Even if a user has "permission" (via standard Linux permissions) to touch a file, MAC (Mandatory Access Control) can block it based on a security profile.

AppArmor (Application Armor):
- Uses Path-based profiles. It restricts what files an executable can read, write, or execute.
- K8s Usage: You load profiles onto the node (e.g., k8s-nginx) and reference them in the Pod spec.
- Modes: Profiles can be in Enforce (block actions) or Complain (log actions) mode.
SELinux (Security-Enhanced Linux):
- Uses Label-based controls. Every file and process has a label (e.g., container_t). Policies dictate how labels interact.
- K8s Usage: Kubernetes can automatically mount volumes with the correct SELinux context using the seLinuxOptions field or mount options.

4. User Namespaces and UID Mapping

Concept: Identity Isolation (The Ultimate Illusion). User Namespaces decouple the user ID (UID) inside the container from the UID on the host.

The Mechanism:
- Inside the container: The process looks like root (UID 0). It can install packages and modify files owned by root inside the container image.
- Outside (On the Host): The kernel maps UID 0 inside to an unprivileged UID (e.g., UID 65534) on the host.
Security Implications: This massively reduces risk. If an attacker manages to "break out" of the container, they find themselves as a powerless user on the host node, unable to access system files or kill processes.
Configuration: Enabled via the hostUsers field in the Pod spec (requires Kubernetes v1.25+ and specific feature gates).

Summary Comparison

Primitive	Controls...	Best Practice
Capabilities	Privilege Power. (Can I modify network settings?)	Drop `ALL`, add only needed.
Seccomp	Kernel Actions. (Can I call `reboot` or `swapon`?)	Use `RuntimeDefault`.
AppArmor/SELinux	Resource Access. (Can I write to `/etc`?)	Use standard profiles; avoid `Unconfined`.
User Namespaces	User Identity. (Am I really root?)	Enable `hostUsers: false` where supported.

What are the security primitives used to harden containerized workloads?

1. Linux Capabilities ​

2. Seccomp (Secure Computing Mode) ​

3. AppArmor / SELinux (Mandatory Access Control) ​

4. User Namespaces and UID Mapping ​

Summary Comparison ​

1. Linux Capabilities

2. Seccomp (Secure Computing Mode)

3. AppArmor / SELinux (Mandatory Access Control)

4. User Namespaces and UID Mapping

Summary Comparison