Rolling Node Replacement: The Safest Way to Upgrade Kubernetes

15 / Mar / 2026 by Chetan Singh 0 comments

Introduction

What if upgrading your Kubernetes cluster required no downtime at all?

Imagine if you could upgrade your Kubernetes cluster and keep everything running smoothly, with zero downtime. Sounds pretty great, right? A lot of teams worry that upgrading will mean their apps go offline, but with solid planning, it’s actually possible to have safe and totally disruption-free upgrades.

Kubernetes upgrades are basically swapping out your old nodes or cluster version to grab security patches, better performance, and support for newer APIs. Staying up-to-date matters — old nodes open the door to vulnerabilities, outdated features, and flaky workloads.

Here’s what you’ll get from this blog:

  • What a Kubernetes node upgrade actually is
  • Why upgrades matter in production
  • How pros handle upgrades, step-by-step
  • Upgrading clusters without node groups or Karpenter
  • Tips for true zero-downtime upgrades

What Is a Kubernetes Node Upgrade?

So, what is a Kubernetes node upgrade? It’s about replacing old worker nodes with ones running the latest OS image, Kubernetes version, or security fixes. Instead of poking at nodes in place, production setups use a rolling replacement: add new nodes, shift workloads, and remove old nodes. This keeps your apps up and running through the whole upgrade.

Why upgrade?

  • Patch security holes
  • Avoid broken APIs
  • Boost performance and reliability
  • Stay compatible with tools and add-ons
  • Keep your vendor support intact

Skipping upgrades? That just sets you up for headaches down the line.

Upgrade Architecture Flow

Let’s look at the upgrade flow you’d follow in a real production environment: EKS Upgrade Flow

eks_upgrade_flow

eks_upgrade_flow

Pre-Upgrade Checklist

1. Verify Cluster Health

kubectl get nodes
kubectl get pods -A

All nodes must be ready.

2. Spot Deprecated APIs (tools like Pluto help)

pluto detect -A

If Pluto reports deprecated APIs, fix them before upgrading.

3. Write Down Your Cluster’s Details

aws eks describe-cluster –name <cluster>

Record:

  • Endpoint
  • Certificate
  • CIDR
  • Cluster name
pre_upgrade_checklist

pre_upgrade_checklist

Universal Upgrade Method (Works Everywhere)

This method works for:

  • Managed node groups
  • Self-managed nodes
  • Bare-metal clusters
  • Clusters without autoscalers

Step 1 – Add New Nodes

Create new nodes using the updated image/template.

kubectl get nodes

Wait until they show Ready.

Step 2 – Stop Scheduling on Old Node

kubectl cordon <node>

Step 3 – Validate New Nodes

Restart one deployment:

kubectl rollout restart deployment <app>

If pods start successfully → continue.

Step 4 – Validate Workloads Before Draining (Critical Step)

Check where pods are running:

kubectl get pods -o wide

Ensure:

  • Pods are running on new nodes
  • All replicas are healthy
  • No pods are pending
  • Applications are accessible

Never drain until workloads are confirmed healthy on new nodes.

Step 5 – Drain the Old Node

Now safely evict pods:

kubectl drain <node> –ignore-daemonsets –delete-emptydir-data

This command:

  • Evicts running pods
  • Reschedules them on available nodes
  • Skips DaemonSets (CNI, kube-proxy, etc.)

Evict pods so they get rescheduled on available nodes. DaemonSets (like CNI and kube-proxy) aren’t touched. Wait for the drain command to finish.

Step 6 – Validate After Drain

kubectl get pods -A
kubectl get events –sort-by=.metadata.creationTimestamp

Confirm:

  • No CrashLoopBackOff pods
  • No scheduling failures
  • No Pending workloads

Step 7 – Remove Old Node

kubectl delete node <node>

Terminate the underlying VM if required.

Step 8 – Repeat

Repeat for the remaining nodes until all are upgraded.

Golden Rule for Production Upgrades

Add capacity → Validate workloads → Drain → Validate again → Delete node

Skipping validation is the most common cause of upgrade-related downtime.

Zero Downtime Requirements

To avoid downtime during the upgrade:

  • Minimum 2 replicas per deployment
  • Readiness probes configured
  • PodDisruptionBudget enabled
  • Extra cluster capacity available

Common Mistakes to Avoid

  • Upgrading nodes before the control plane
  • Draining all nodes together
  • Ignoring deprecated APIs
  • No spare capacity
  • No rollback plan

Rollback Strategy

If something breaks:

  • Create nodes with the previous image
  • Cordon new nodes
  • Drain new nodes
  • Delete new nodes

This safely restores the previous state.

zero_downtime_requirements_vs_mistakes

zero_downtime_requirements_vs_mistakes

Conclusion

In the end, Kubernetes upgrades shouldn’t keep you up at night. With rolling replacements, you can upgrade with confidence and no downtime, whether you rely on node groups, autoscalers, or manage infrastructure the old-fashioned way.

Key takeaway: Always add new nodes, migrate workloads, and then delete the old ones. Never try upgrading everything all at once.

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *