Skip to content

Cilium eBPF

How does Cilium use eBPF to completely bypass kube-proxy and route a packet without touching iptables?

Cilium is a networking, observability, and security solution built on an eBPF-based data plane, providing a Layer 3 network across Kubernetes clusters.

Cilium deploys a daemon Pod on every node and integrates directly into the Linux kernel. It is a complete replacement for kube-proxy.

eBPF lets the kernel load sandboxed programs at runtime and attach them to specific hooks in the network stack. Instead of iptables' linear rule evaluation, Cilium uses O(1) hash tables called eBPF maps for all service lookups.

Here is the exact architectural journey of a packet from a Pod to a Service IP utilizing Cilium's kube-proxy replacement framework.

1. The Socket Hook (Socket-Level Load Balancing)

The most efficient way to route a packet is to intercept the connection before the packet is even fully constructed by the operating system. Cilium achieves this by attaching eBPF programs directly to Linux cgroups at the socket layer (specifically, attaching BPF_PROG_TYPE_CGROUP_SOCK_ADDR programs to syscall hooks like connect4, connect6, and sendmsg).

When an application inside a source Pod initiates a connection (e.g., calling the connect() system call) to a Kubernetes Service Virtual IP (VIP), the following sequence unfolds:

  1. Syscall Interception: The kernel executes the eBPF program attached to the socket before the TCP/IP stack ever synthesizes the actual packet hardware headers.
  2. Service Map Lookup: The eBPF program reads the requested destination IP (the Service VIP). It queries a pinned eBPF Service Map—a local hash table constantly maintained by the Cilium daemon containing all Kubernetes Services and their corresponding healthy, real backend Pod IPs.
  3. Endpoint Selection: If the destination matches a Service VIP, the eBPF program instantly selects a specific backend Pod IP from the map (effectively acting as an invisible load balancer).
  4. Direct Translation: The eBPF program intercepts and rewrites the destination IP address of the socket struct from the Service VIP directly to the chosen backend Pod IP.

The Result: Because this Destination NAT (DNAT) occurs natively inside the socket structure, the Linux TCP/IP stack generates the outgoing packet with the backend Pod IP already set natively as the destination. The kernel's lower routing layers are entirely unaware that the application originally requested a Service VIP. The packet successfully bypasses iptables and the expensive netfilter connection tracking (conntrack) module entirely.

2. The TC Ingress Hook (Traffic Control)

While socket-level load balancing perfectly handles outbound connections strictly originating from Pods, Cilium must also orchestrate packets that originate from outside the worker node (e.g., external clients hitting a NodePort or LoadBalancer Service), or complex UDP traffic that behaves fundamentally differently than TCP connection streams.

For this external and complex traffic, Cilium attaches secondary eBPF programs to the Traffic Control (TC) subsystem of the Linux network interfaces—specifically, the veth (virtual ethernet) interface of the Pods and the physical interface (e.g., eth0) of the host machine.

When a packet arrives at an interface:

  1. Early Interception: The eBPF program hooked at the TC ingress intercepts the raw packet as soon as it is pulled from the network interface driver ring buffer, long before it ever reaches the standard Linux routing daemon or legacy iptables chains.

  2. Policy and Routing Maps: The eBPF program parses the packet headers and immediately queries the eBPF Policy Maps and Endpoint Maps.

  3. Identity Resolution: Unlike standard firewalls that filter by volatile IP addresses, Cilium uses an identity-based security model absolutely decoupled from network addressing. The eBPF program resolves the identity of both source and destination — based on labels, not IP addresses — and checks whether the NetworkPolicy permits the connection.

  4. Packet Forwarding: If the traffic policy is approved, and if DNAT is required (e.g., for external traffic hitting a Service VIP), the eBPF program modifies the packet IP headers directly in kernel memory. It then issues an immediate redirect command (such as bpf_redirect or bpf_redirect_peer) to push the packet directly into the destination Pod's local veth interface, bypassing the host routing table.

3. The Architectural Benefits of Bypassing iptables

By entirely replacing kube-proxy and eliminating the iptables framework from the core data path, Cilium fundamentally alters the performance characteristics and scalability limits of the Kubernetes cluster network:

1. Eliminating Algorithmic Complexity

Standard kube-proxy operating in iptables mode relies on sequential firewall rule evaluation. If a cluster scales to 10,000 Services, the Linux kernel must traverse an O(N) list of rules linearly to find the correct Service translation, causing severe CPU spikes and extreme routing latency.

eBPF maps, however, are implemented as hash tables. Service endpoint lookups occur in O(1) constant time, regardless of whether the cluster possesses 10 Services or 100,000 Services.

2. Reduced Context Switching

The socket-level load balancing hook translates the address before the packet structurally traverses the upper networking stack. This drastically reduces the physical path the packet must take through the kernel subsystems, dramatically lowering network transmission latency for microservice requests.

3. Bypassing Netfilter Exhaustion

The standard Linux connection tracking (conntrack) daemon utilized heavily by iptables is prone to race conditions and exhaustion under heavy, concurrent microservice loads (e.g., millions of localized UDP DNS queries rapidly filling up the finite conntrack table).

By tracking connection state inside eBPF maps, Cilium avoids the lock contention and table exhaustion that plague conntrack under high load.

Based on Kubernetes v1.35 (Timbernetes). Changelog.