Appearance
Kubernetes Cluster Discovery & Storage Investigation Playbook
Kubernetes Cluster Discovery & Storage Investigation
A practical field guide for when you land on an unfamiliar Kubernetes cluster and need to understand what's running, where data lives, and how to recover it.
Phase 1 — Cluster Scope
First thing — understand what you're dealing with at the cluster level.
bash
kubectl cluster-info
kubectl version
kubectl get nodes -o wide
kubectl top nodes
kubectl describe nodes | grep -E "Taints|Roles|Capacity|Allocatable|CPU|Memory"Phase 2 — Namespace Inventory
Get a sense of scale before diving into individual resources.
bash
kubectl get namespaces
kubectl get all -A | wc -l
kubectl get pods -A --no-headers | awk '{print $1}' | sort | uniq -c | sort -rnPhase 3 — Workload Overview
Map out everything that's running.
bash
kubectl get pods -A -o wide
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get deployments -A
kubectl get statefulsets -A
kubectl get daemonsets -A
kubectl get jobs -A
kubectl get cronjobs -APhase 4 — Unhealthy & Problem Pods
Find what's broken before anything else.
bash
kubectl get pods -A | grep -vE "Running|Completed"
kubectl get pods -A | grep -E "CrashLoop|Error|Pending|Evicted|OOMKilled"
kubectl get pods -A -o json | jq '.items[] | select(.status.containerStatuses[]?.restartCount > 5) | .metadata.name'
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20Phase 5 — Networking
bash
kubectl get svc -A
kubectl get ingress -A
kubectl get networkpolicies -A
kubectl get endpoints -A | grep "<none>"
kubectl get svc -A | grep -v ClusterIPPhase 6 — Storage
bash
kubectl get pv -A
kubectl get pvc -A
kubectl get pvc -A | grep -v Bound
kubectl get storageclassPhase 7 — Config & Secrets
bash
kubectl get configmaps -A
kubectl get secrets -A
kubectl get secrets -A | grep -v "kubernetes.io/service-account"Phase 8 — RBAC & Security
bash
kubectl get roles -A
kubectl get rolebindings -A
kubectl get clusterroles | grep -v "system:"
kubectl get clusterrolebindings | grep -v "system:"
kubectl auth can-i --list
kubectl get serviceaccounts -APhase 9 — Events (the most underused)
Events tell you what the cluster has been doing. Always check these.
bash
kubectl get events -A --sort-by='.lastTimestamp'
kubectl get events -A --field-selector=type=Warning
kubectl get events -A --sort-by='.lastTimestamp' | tail -30Phase 10 — Resource Pressure
Find pods with no resource limits — these are ticking time bombs.
bash
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.limits == null) | .metadata.name'
kubectl get limitrange -A
kubectl get resourcequota -APhase 11 — Add-ons & Platform
bash
kubectl get pods -n kube-system
kubectl get pods -n monitoring 2>/dev/null
kubectl get pods -n ingress-nginx 2>/dev/null
kubectl get pods -n cert-manager 2>/dev/null
helm list -A 2>/dev/nullPhase 12 — One-shot Triage Command
When you need a quick answer on what's broken:
bash
kubectl get pods -A -o wide | grep -vE "Running|Completed" | sort -k4Storage Deep Dive
1. Which Pod is Using What Storage
bash
kubectl get pods -A -o json | jq '.items[] | {pod: .metadata.name, ns: .metadata.namespace, volumes: [.spec.volumes[]? | select(.persistentVolumeClaim) | .persistentVolumeClaim.claimName]}'
kubectl get pvc -A
kubectl get pvc -n <namespace> -o wide
# Map pod -> PVC -> PV in one view
kubectl get pods -n <namespace> -o json | jq '.items[] | {pod: .metadata.name, pvcs: [.spec.volumes[]?.persistentVolumeClaim?.claimName // empty]}'2. Where is Data Actually Stored + Protocol
bash
# PVC -> PV binding
kubectl get pvc -n <namespace> <pvc-name> -o yaml | grep -E "volumeName|storageClass|accessModes"
# PV details — where it actually lives
kubectl get pv <pv-name> -o yamlThe answer is in .spec — look for one of these blocks:
| Field | Meaning |
|---|---|
hostPath | Local path on a specific node |
nfs | NFS server + exported path |
csi | Cloud/vendor CSI driver |
emptyDir | RAM or node temp disk — not persistent |
awsEBS / gcePD / azureDisk | Cloud block storage |
bash
# Check what type a PV is
kubectl get pv <pv-name> -o json | jq '.spec | keys'
# Extract the key details
kubectl get pv <pv-name> -o json | jq '{driver: .spec.csi?.driver, path: .spec.hostPath?.path, nfs: .spec.nfs}'
# StorageClass tells you the provisioner and protocol
kubectl describe storageclass <name>
kubectl get storageclass -o yaml | grep provisioner3. Will the Data Persist?
bash
# Reclaim policy: Retain = data survives PVC deletion, Delete = data gone
kubectl get pv <pv-name> -o json | jq '{reclaimPolicy: .spec.persistentVolumeReclaimPolicy, volumeMode: .spec.volumeMode}'
# emptyDir = dies with the pod, never persists
kubectl get pod <pod-name> -n <namespace> -o json | jq '.spec.volumes[] | select(.emptyDir) | .name'
# Check if PVC is owned by a parent resource (gets deleted with it)
kubectl get pvc <pvc-name> -n <namespace> -o json | jq '.metadata.ownerReferences'
# Full persistence overview for all PVs
kubectl get pv -o json | jq '.items[] | {name: .metadata.name, reclaim: .spec.persistentVolumeReclaimPolicy, status: .status.phase}'Quick rule of thumb:
emptyDir→ data gone when pod dieshostPath→ data survives pod restarts but tied to one nodePVCwithreclaimPolicy: Delete→ data gone when PVC deletedPVCwithreclaimPolicy: Retain→ data survives PVC deletion, PV must be manually reclaimed
4. Full Persistence Picture — One Command
bash
kubectl get pv -o json | jq '.items[] | {
name: .metadata.name,
reclaim: .spec.persistentVolumeReclaimPolicy,
type: (.spec | keys | map(select(. != "accessModes" and . != "capacity" and . != "claimRef" and . != "volumeMode" and . != "persistentVolumeReclaimPolicy" and . != "storageClassName" and . != "nodeAffinity")))[0],
claim: .spec.claimRef.name,
namespace: .spec.claimRef.namespace
}'Data Recovery — Copy Data & Restore to a New Pod
Step 1 — Find the Data Path and Node
bash
kubectl get pv <pv-name> -o json | jq '.spec.hostPath.path // .spec.nfs // .spec.csi'
kubectl get pod <pod-name> -n <namespace> -o json | jq '.spec.nodeName'Step 2 — Copy Data Out
bash
# Option A: kubectl cp (easiest, works for running pods)
kubectl cp <namespace>/<pod-name>:/var/lib/mysql ./mysql-backup
# Option B: exec + tar stream
kubectl exec -n <namespace> <pod-name> -- tar czf - /var/lib/mysql | tar xzf - -C ./mysql-backup
# Option C: proper database dumps (always preferred for DBs)
# MySQL
kubectl exec -n <namespace> <pod-name> -- mysqldump -u root -p<pass> --all-databases > dump.sql
# PostgreSQL
kubectl exec -n <namespace> <pod-name> -- pg_dumpall -U postgres > dump.sql
# MongoDB
kubectl exec -n <namespace> <pod-name> -- mongodump --out /tmp/mongodump
kubectl cp <namespace>/<pod-name>:/tmp/mongodump ./mongodumpStep 3 — Create PV/PVC Pointing at Backup Data
yaml
# pv-restore.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: restore-pv
spec:
capacity:
storage: 10Gi
accessModes: [ReadWriteOnce]
persistentVolumeReclaimPolicy: Retain
hostPath:
path: /path/to/your/mysql-backup
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
namespace: default
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
volumeName: restore-pv # bind directly to the PV aboveStep 4 — Spin Up Pod Using That PVC
yaml
# restore-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: mysql-restore
namespace: default
spec:
containers:
- name: mysql
image: mysql:8.0
env:
- name: MYSQL_ROOT_PASSWORD
value: "yourpassword"
volumeMounts:
- name: data
mountPath: /var/lib/mysql
volumes:
- name: data
persistentVolumeClaim:
claimName: restore-pvcbash
kubectl apply -f pv-restore.yaml
kubectl apply -f restore-pod.yaml
# Wait for pod to be ready
kubectl wait --for=condition=Ready pod/mysql-restore --timeout=60s
# Verify data is there
kubectl exec -it mysql-restore -- mysql -u root -p -e "show databases;"Step 5 — If Restoring from SQL Dump
bash
# Push dump into the running pod
kubectl exec -i mysql-restore -- mysql -u root -pyourpassword < dump.sql
# Verify
kubectl exec -it mysql-restore -- mysql -u root -p -e "show databases;"Quick Reference — Storage Decision Matrix
| Volume Type | Survives Pod Restart | Survives Node Reboot | Survives PVC Delete | Notes |
|---|---|---|---|---|
emptyDir | ✗ | ✗ | ✗ | Temp scratch space only |
hostPath | ✓ | ✓ | ✓ | Tied to one node |
PVC + Delete policy | ✓ | ✓ | ✗ | Default for most cloud provisioners |
PVC + Retain policy | ✓ | ✓ | ✓ | Data survives, PV must be reclaimed manually |
emptyDir with medium: Memory | ✗ | ✗ | ✗ | RAM disk — fastest, most volatile |