How do I discover the state of an unfamiliar cluster and its storage?

When handed a Kubernetes cluster with hundreds of pods, you need a systematic approach to discovery. This field guide uses kubectl and jq to map out the environment, find broken workloads, and extract data from Persistent Volumes.

Phase 1: The Cluster Triage Workflow

Instead of randomly clicking through a dashboard, use these phases to build a mental map of the cluster.

1. Scope & Scale

Understand the physical capacity and boundaries of the cluster.

bash

# Get the Kubernetes version and API server address
kubectl cluster-info
kubectl version

# Check the nodes, their capacity, and any taints restricting placement
kubectl get nodes -o wide
kubectl top nodes
kubectl describe nodes | grep -E "Taints|Roles|Capacity|Allocatable|CPU|Memory"

2. Workload Overview

Identify what is running and, more importantly, where it is running.

bash

# Count total namespaces and objects
kubectl get namespaces
kubectl get all -A | wc -l

# See a rolled-up count of pods per namespace
kubectl get pods -A --no-headers | awk '{print $1}' | sort | uniq -c | sort -rn

# Inventory the higher-level controllers managing the pods
kubectl get deployments,statefulsets,daemonsets,jobs -A

3. Finding the Fire (Unhealthy Pods)

The most important one-shot triage command. This filters out all healthy Pods so you can focus strictly on what is broken.

bash

kubectl get pods -A -o wide | grep -vE "Running|Completed" | sort -k4

# Find pods specifically stuck in CrashLoopBackOff or OOMKilled states
kubectl get pods -A | grep -E "CrashLoop|Error|Pending|Evicted|OOMKilled"

# Find highly unstable pods (restarted more than 5 times)
kubectl get pods -A -o json | jq '.items[] | select(.status.containerStatuses[]?.restartCount > 5) | .metadata.name'

4. Networking & Routing Map

Determine how traffic enters the cluster and reaches the Pods.

bash

# Find all Ingress routes and Gateway API configurations
kubectl get ingress,gateway,httproute -A

# Find Services that are exposed externally (LoadBalancer or NodePort)
kubectl get svc -A | grep -v ClusterIP

# Crucial: Find "Broken Links" (Services that have no healthy backing Pods)
kubectl get endpoints -A | grep "<none>"

5. Security Posture

bash

# See what ServiceAccounts are being used
kubectl get serviceaccounts -A

# See who can do what (RBAC)
kubectl get clusterrolebindings | grep -v "system:"

Phase 2: Storage Discovery & Extraction

When a database Pod crashes in an unfamiliar cluster, you must find where the actual data lives on disk, figure out the StorageClass protocol, and safely extract it.

1. Map the Pod to the Data Path

This command connects a specific Pod directly to its underlying Persistent Volume Claim (PVC).

bash

kubectl get pods -A -o json | jq '.items[] | {pod: .metadata.name, ns: .metadata.namespace, pvcs: [.spec.volumes[]?.persistentVolumeClaim?.claimName // empty]}'

2. Determine the Protocol and Location

Once you have the PVC name, you need to find the actual PersistentVolume (PV) it binds to, which reveals where the data resides on the infrastructure.

bash

# Find the StorageClass and VolumeName for a PVC
kubectl get pvc -n <namespace> <pvc-name> -o yaml | grep -E "volumeName|storageClass"

# Inspect the PV to find the actual backend data path (.spec block)
kubectl get pv <pv-name> -o json | jq '{driver: .spec.csi?.driver, path: .spec.hostPath?.path, nfs: .spec.nfs}'

hostPath: Data is on the local node's filesystem.
nfs: Points to an external NFS server IP and export path.
csi: Points to a cloud block storage volume (e.g., AWS EBS) managed by a CSI Driver.

3. Will the Data Persist if the Pod Dies?

Check the Reclaim Policy of the volume. If it is set to Delete, the data will be destroyed when the PVC is deleted.

bash

# Identify volumes that will survive deletion (Retain policy)
kubectl get pv -o json | jq '.items[] | {name: .metadata.name, reclaim: .spec.persistentVolumeReclaimPolicy, status: .status.phase}'

# Warning: Check if the Pod is using an emptyDir (which ALWAYS dies with the Pod)
kubectl get pod <pod-name> -n <namespace> -o json | jq '.spec.volumes[] | select(.emptyDir) | .name'

4. Extracting Database Dumps

If the Pod is still running but unhealthy, the safest way to extract data without dealing with raw filesystem mounts is to stream a live database dump directly to your local machine.

bash

# MySQL
kubectl exec -n <namespace> <pod-name> -- mysqldump -u root -p<pass> --all-databases > local-dump.sql

# PostgreSQL
kubectl exec -n <namespace> <pod-name> -- pg_dumpall -U postgres > local-dump.sql

# Generic File Copy (If the DB is stopped but the Pod is sleeping)
kubectl cp <namespace>/<pod-name>:/var/lib/mysql ./local-mysql-backup

How do I discover the state of an unfamiliar cluster and its storage?

Phase 1: The Cluster Triage Workflow ​

1. Scope & Scale ​

2. Workload Overview ​

3. Finding the Fire (Unhealthy Pods) ​

4. Networking & Routing Map ​

5. Security Posture ​

Phase 2: Storage Discovery & Extraction ​

1. Map the Pod to the Data Path ​

2. Determine the Protocol and Location ​

3. Will the Data Persist if the Pod Dies? ​

4. Extracting Database Dumps ​

Phase 1: The Cluster Triage Workflow

1. Scope & Scale

2. Workload Overview

3. Finding the Fire (Unhealthy Pods)

4. Networking & Routing Map

5. Security Posture

Phase 2: Storage Discovery & Extraction

1. Map the Pod to the Data Path

2. Determine the Protocol and Location

3. Will the Data Persist if the Pod Dies?

4. Extracting Database Dumps