Skip to content

Etcd Operations & Disaster Recovery

Etcd is the brain of the cluster. If it dies, the API server goes read-only (or fails entirely). You must know how to back it up and restore it.

1. The Environment Variables

Etcd is secured by Mutual TLS (mTLS). You cannot just run etcdctl member list. You need to pass the certs every time.

The Alias (Save this to ~/.bashrc)

bash
export ETCDCTL_API=3
alias e='etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key'

Now you can just type e.

2. Backup (Snapshot Save)

Taking a snapshot is an atomic operation.

bash
e snapshot save /tmp/etcd-backup.db

Verify the backup:

bash
e snapshot status /tmp/etcd-backup.db --write-out=table

3. Restore (The Dangerous Part)

To restore, you must:

  1. Stop all API Servers (move the static pod manifests out of /etc/kubernetes/manifests).
  2. Restore the Data to a new data directory.
  3. Update Etcd Manifest to point to the new data directory.

Command:

bash
e snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd-restored

4. Debugging Etcd

Check Cluster Health

bash
e endpoint health
e endpoint status --write-out=table

If IS LEARNER column says true, that member is still catching up. if HEALTH is false, check port 2379 connectivity.