Skip to content

How do I troubleshoot etcd in a Kubernetes cluster?

When etcd degrades, your API server typically follows suit. You will see the API server fail to start or become unresponsive — meaning you can no longer deploy new Pods or change cluster state. However, already-running Pods continue to function normally, provided they don't depend on the API server to operate.

Because etcd is a leader-based distributed system, it requires a stable leader that periodically sends heartbeats to all of its followers. When diagnosing an etcd problem, you are almost always trying to determine if the cluster has lost its leader due to one of three root causes: quorum collapse, disk I/O starvation, or network latency.


The Diagnostic Tools

Before identifying the exact bottleneck, you need the right utilities. There are two primary tools for interacting with etcd:

  • etcdctl — network-based interactions: checking cluster health, managing keys, administering cluster membership
  • etcdutl — direct operations on local etcd data files: restoring snapshots, defragmenting the database, validating data consistency

Know which one to reach for. etcdctl talks to a running cluster over the network. etcdutl operates directly on the data files on disk — useful when the cluster itself is down.


1. Diagnosing Quorum Issues

etcd relies on a consensus protocol — a strict majority (quorum) of nodes must be healthy and communicating to elect a leader and commit writes. This is why production clusters always run an odd number of etcd members, typically three or five.

Signs it is a quorum issue: The majority of your etcd members have failed or become isolated. The cluster cannot make any state changes. Kubernetes cannot schedule new Pods.

bash
# List members — confirm the cluster recognizes all expected nodes
etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check health of each endpoint
etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

If more than half of your nodes show as unhealthy or unreachable, you have lost quorum. Manual intervention is required — remove the broken members and add new ones to restore majority:

bash
# Remove a permanently failed member
etcdctl member remove <MEMBER_ID>

# Add a new replacement member
etcdctl member add <name> --peer-urls=https://<new-node-ip>:2380

Quorum math: A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2. A 2-node cluster tolerates 0 — it's effectively worse than a single node because you can lose quorum with one failure and have no way to recover without manual intervention.


2. Diagnosing Disk I/O Issues

etcd is notoriously sensitive to disk I/O performance. Every state mutation must be persisted to a durable Write-Ahead Log (WAL) before etcd acknowledges the write. Any disk latency directly impacts the entire cluster.

Signs it is a disk issue: etcd experiences heartbeat timeouts, leading to cluster instability and constant leader elections. Crucially, the symptoms look exactly like a network issue — dropped heartbeats, failed elections — but the root cause is the disk being too slow to sync data.

How to isolate disk vs network:

bash
# Check if the database has exceeded its storage quota
etcdctl endpoint status \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --write-out=table

Look at the DB SIZE column. The default quota is 2GB — once exceeded, etcd goes into a read-only alarm state and rejects all writes.

bash
# If quota is exceeded, defragment to reclaim space
etcdctl defrag \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Clear the space alarm after defrag
etcdctl alarm disarm \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

For confirming I/O starvation specifically, check fsync latency in your monitoring stack — high fsync duration on the etcd WAL directory is the clearest indicator that your disk cannot keep up. etcd should run on SSDs or NVMe; spinning disks and network-attached storage are common culprits here.


3. Diagnosing Network Issues

Because etcd operates across multiple nodes, the network must reliably carry the leader's heartbeats to followers.

Signs it is a network issue: Heartbeat timeouts identical to disk issues, but your disk metrics are clean. Followers assume the leader is dead and trigger new elections repeatedly, leading to severe instability where no leader successfully holds.

bash
# Check response times across all endpoints
etcdctl endpoint health \
  --endpoints=https://<node1>:2379,https://<node2>:2379,https://<node3>:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --write-out=table

High response times or timeouts with a healthy disk points to a network bottleneck. Check:

  • Port 2379 — client communication between the API server and etcd
  • Port 2380 — server-to-server peer communication between etcd members
  • Firewall rules and security groups between etcd nodes
  • Network interface errors: ip -s link on each etcd node

Quick Diagnosis Flowchart

etcd unhealthy?

├── kubectl get pods -n kube-system | grep etcd  → crashlooping?
│   └── journalctl -u kubelet | grep etcd         → check logs

├── etcdctl endpoint health                        → quorum check
│   └── > 50% unhealthy → quorum lost, add members

├── etcdctl endpoint status --write-out=table      → DB size check
│   └── near 2GB limit → defrag + disarm alarm

└── disk healthy + quorum healthy → check network
    └── ports 2379 and 2380 open between all members?

Based on Kubernetes v1.35 (Timbernetes). Changelog.