How should Kubernetes clusters be upgraded and operated in production to minimize downtime and risk?

Hello. As a Senior Kubernetes Platform Engineer, I will outline the architectural patterns and operational workflows required to maintain a production cluster. Minimizing downtime requires a "shared responsibility" model: the platform team must maintain High Availability (HA) infrastructure, while application teams must configure workloads to tolerate voluntary disruptions.

Here is the engineering guide to upgrading and operating Kubernetes with minimal risk.

1. High Availability (HA) Architecture

To perform upgrades or maintenance without outages, you must eliminate single points of failure in the Control Plane. A production cluster should generally run at least three control plane nodes.

There are two primary topologies for HA clusters:

Stacked etcd: The etcd distributed data store runs on the same nodes as the control plane components (API server, Scheduler, Controller Manager). This is simpler to set up but couples the failure domains; if a node dies, you lose both a control plane replica and an etcd member.
External etcd: The etcd cluster runs on separate hosts. This decouples the control plane from the data store, offering better redundancy (losing a control plane node doesn't impact data consensus), but requires twice the number of hosts.

Engineering Best Practice: Regardless of topology, spread these nodes across multiple failure zones (availability zones) to ensure the cluster survives a zonal outage.

2. The Upgrade Strategy: Order of Operations

Upgrading a cluster is a staged process. You must strictly adhere to the supported version skew policies to maintain compatibility between components during the transition. You should never skip minor versions (e.g., do not jump from 1.33 to 1.35 directly).

The mandatory upgrade sequence is:

Upgrade the Control Plane: Upgrade the primary control plane node first, followed by the remaining control plane nodes. This involves upgrading the API server, Controller Manager, Scheduler, and etcd.
Upgrade CNI Plugins: After the control plane is stable, manually upgrade your Container Network Interface (CNI) provider if required by the new Kubernetes version.
Upgrade Worker Nodes: Finally, upgrade the kubelet and kubectl on worker nodes, usually in batches or one by one.

Why this order? The kubelet is designed to communicate with an API server that is the same version or newer. An older API server may not understand data sent by a newer kubelet. Therefore, the Control Plane (API Server) must always be upgraded before the Kubelets (Nodes).

3. Safely Upgrading Nodes (The Drain Workflow)

To upgrade a node (e.g., to update the kubelet or the OS kernel), you must remove it from service without killing the applications running on it abruptly. This is achieved using the Drain API.

The Workflow:

Cordon: Mark the node as unschedulable. This prevents new Pods from landing on the node while you are working on it.
Drain: The kubectl drain command evicts existing Pods. It safely terminates containers by sending a SIGTERM, waiting for the terminationGracePeriodSeconds, and respecting PodDisruptionBudgets (PDBs).
Maintenance/Upgrade: Perform the kubelet upgrade or OS patch.
Uncordon: Mark the node as schedulable again to allow workloads to return.

4. Workload Configuration for Reliability

The platform cannot guarantee zero downtime if the applications are not configured to be resilient. You must enforce the following standards for production workloads:

Replicas: Run at least two replicas of stateless applications behind a Service. Single-instance pods will experience downtime during a node drain.
PodDisruptionBudgets (PDB): A PDB limits the number of concurrent disruptions your application can tolerate. For example, if you have 5 replicas, you can set minAvailable: 4. If an administrator tries to drain a node that would cause availability to drop below 4, the drain API will block and wait until other replicas are healthy.
- Warning: Do not set maxUnavailable to 0 or minAvailable to 100%, as this will block all node maintenance indefinitely.
Probes: Define Readiness Probes. During a rolling update, a new Pod should not receive traffic until it is fully initialized. Without readiness probes, the Service load balancer might send traffic to a Pod that is crash-looping or still starting up, causing errors.
Rolling Updates: Use the RollingUpdate strategy for Deployments. This ensures that old Pods are only killed after new Pods are started and Ready.

5. Risk Mitigation Checklist

Before touching a production cluster, verify the following:

Etcd Backups: Etcd is the source of truth for the cluster. You must have a recent snapshot before upgrading. If the control plane fails catastrophically, you can restore the cluster state from this snapshot.
Deprecation Check: APIs are removed in newer versions. Before upgrading, check for the use of deprecated APIs (e.g., v1beta1 versions of resources) and migrate manifests to the stable v1 API. Tools like kubectl convert can assist with this.
Quota Management: Ensure namespaces have ResourceQuotas to prevent a single tenant from consuming all resources during an upgrade-induced rescheduling event.

Summary Table: Operational Roles

Responsibility	Action
Platform Team	Provision HA Control Plane, perform Etcd backups, manage Node upgrades via Drain/Cordon.
App Team	Define PDBs, Liveness/Readiness Probes, and ensure Deployments use RollingUpdate strategies.
Automated System	`kube-scheduler` places pods on healthy nodes; `kube-controller-manager` ensures desired replica counts are met.

How should Kubernetes clusters be upgraded and operated in production to minimize downtime and risk?

1. High Availability (HA) Architecture ​

2. The Upgrade Strategy: Order of Operations ​

3. Safely Upgrading Nodes (The Drain Workflow) ​

4. Workload Configuration for Reliability ​

5. Risk Mitigation Checklist ​

1. High Availability (HA) Architecture

2. The Upgrade Strategy: Order of Operations

3. Safely Upgrading Nodes (The Drain Workflow)

4. Workload Configuration for Reliability

5. Risk Mitigation Checklist