Rolling Node Replacement: The Safest Way to Upgrade Kubernetes

Introduction

What if upgrading your Kubernetes cluster required no downtime at all?

Imagine if you could upgrade your Kubernetes cluster and keep everything running smoothly, with zero downtime. Sounds pretty great, right? A lot of teams worry that upgrading will mean their apps go offline, but with solid planning, it’s actually possible to have safe and totally disruption-free upgrades.

Kubernetes upgrades are basically swapping out your old nodes or cluster version to grab security patches, better performance, and support for newer APIs. Staying up-to-date matters — old nodes open the door to vulnerabilities, outdated features, and flaky workloads.

Here’s what you’ll get from this blog:

What a Kubernetes node upgrade actually is
Why upgrades matter in production
How pros handle upgrades, step-by-step
Upgrading clusters without node groups or Karpenter
Tips for true zero-downtime upgrades

What Is a Kubernetes Node Upgrade?

So, what is a Kubernetes node upgrade? It’s about replacing old worker nodes with ones running the latest OS image, Kubernetes version, or security fixes. Instead of poking at nodes in place, production setups use a rolling replacement: add new nodes, shift workloads, and remove old nodes. This keeps your apps up and running through the whole upgrade.

Why upgrade?

Patch security holes
Avoid broken APIs
Boost performance and reliability
Stay compatible with tools and add-ons
Keep your vendor support intact

Skipping upgrades? That just sets you up for headaches down the line.

Upgrade Architecture Flow

Let’s look at the upgrade flow you’d follow in a real production environment: EKS Upgrade Flow

eks_upgrade_flow

Pre-Upgrade Checklist

1. Verify Cluster Health

kubectl get nodes
kubectl get pods -A

All nodes must be ready.

2. Spot Deprecated APIs (tools like Pluto help)

pluto detect -A

If Pluto reports deprecated APIs, fix them before upgrading.

3. Write Down Your Cluster’s Details

aws eks describe-cluster –name <cluster>

Record:

Endpoint
Certificate
CIDR
Cluster name

pre_upgrade_checklist

Universal Upgrade Method (Works Everywhere)

This method works for:

Managed node groups
Self-managed nodes
Bare-metal clusters
Clusters without autoscalers

Step 1 – Add New Nodes

Create new nodes using the updated image/template.

kubectl get nodes

Wait until they show Ready.

Step 2 – Stop Scheduling on Old Node

kubectl cordon <node>

Step 3 – Validate New Nodes

Restart one deployment:

kubectl rollout restart deployment <app>

If pods start successfully → continue.

Step 4 – Validate Workloads Before Draining (Critical Step)

Check where pods are running:

kubectl get pods -o wide

Ensure:

Pods are running on new nodes
All replicas are healthy
No pods are pending
Applications are accessible

Never drain until workloads are confirmed healthy on new nodes.

Step 5 – Drain the Old Node

Now safely evict pods:

kubectl drain <node> –ignore-daemonsets –delete-emptydir-data

This command:

Evicts running pods
Reschedules them on available nodes
Skips DaemonSets (CNI, kube-proxy, etc.)

Evict pods so they get rescheduled on available nodes. DaemonSets (like CNI and kube-proxy) aren’t touched. Wait for the drain command to finish.

Step 6 – Validate After Drain

kubectl get pods -A
kubectl get events –sort-by=.metadata.creationTimestamp

Confirm:

No CrashLoopBackOff pods
No scheduling failures
No Pending workloads

Step 7 – Remove Old Node

kubectl delete node <node>

Terminate the underlying VM if required.

Step 8 – Repeat

Repeat for the remaining nodes until all are upgraded.

Golden Rule for Production Upgrades

Add capacity → Validate workloads → Drain → Validate again → Delete node

Skipping validation is the most common cause of upgrade-related downtime.

Zero Downtime Requirements

To avoid downtime during the upgrade:

Minimum 2 replicas per deployment
Readiness probes configured
PodDisruptionBudget enabled
Extra cluster capacity available

Common Mistakes to Avoid

Upgrading nodes before the control plane
Draining all nodes together
Ignoring deprecated APIs
No spare capacity
No rollback plan

Rollback Strategy

If something breaks:

Create nodes with the previous image
Cordon new nodes
Drain new nodes
Delete new nodes

This safely restores the previous state.

zero_downtime_requirements_vs_mistakes

Conclusion

In the end, Kubernetes upgrades shouldn’t keep you up at night. With rolling replacements, you can upgrade with confidence and no downtime, whether you rely on node groups, autoscalers, or manage infrastructure the old-fashioned way.

Key takeaway: Always add new nodes, migrate workloads, and then delete the old ones. Never try upgrading everything all at once.