Upgrading a Proxmox cluster with nobody noticing
Here’s a thing that still feels a little magical to me: I can take a rack of servers running every important service in the house — DNS, the firewall, photos, the lot — bump the underlying hypervisor up a whole major version, and nobody using any of it notices a thing. No maintenance window. No “the internet is down” from across the house. The services just keep running while the floor gets rebuilt underneath them.
That’s the payoff of running a cluster, and it used to be reserved for enterprise data centers. You don’t need any of this to self-host — I do it because it’s satisfying, and the skills map onto the day job. This is the high-level version of how it works.
Why bother upgrading at all?
Fair question. If the cluster is running fine, why touch it?
A few honest reasons. New major versions bring a newer base OS and kernel, security fixes that eventually stop landing on the old release, and features you’ll want later. More than anything, software that never gets upgraded becomes software you’re scared to upgrade — the gap widens until you’re doing a risky big-bang rebuild instead of a calm routine bump. Staying roughly current is the cheapest insurance there is. (You don’t have to chase day-one releases, though — letting a version settle for a few weeks while braver people find the early bugs is perfectly respectable.)
What a cluster actually buys you
A single server can run a lot of virtual machines, but it’s still one server. Reboot it and everything on it goes down with it.
A cluster is several physical servers that agree to act as one system, sharing a single management view. The important part: they can hand running virtual machines back and forth while those machines are running. It’s called live migration. The VM’s memory gets copied to another host in the background, and at the end there’s a switchover so fast the OS inside never realizes it changed buildings.
Once you have that, maintenance stops being scary. You can drain a host — push every VM off it onto the others — take that now-empty host offline, do whatever you need, and the rest of the cluster keeps serving everything the whole time. Work that used to require a downtime window now requires patience and a free evening.
The rolling upgrade, in plain terms
Upgrading a major version across a three-node cluster looks like this, repeated for each node in turn:
- Pick a node and drain it. Live-migrate every VM off it onto the others. The cluster now carries the full workload on fewer hosts — which is exactly why you don’t want to run a cluster at 90% capacity.
- Upgrade the now-empty node. Point the package sources at the new release, run the upgrade, reboot. Nothing is running on it, so the reboot costs nothing.
- Verify it. Does it come back up, rejoin cleanly, and look healthy? Can a test VM live-migrate onto it from an old-version node and back?
- Move on. Migrate VMs onto the freshly upgraded node, pick the next host, repeat.
You spend part of the upgrade in a mixed-version cluster — some nodes new, some old — which is expected for the duration of a rolling upgrade, but not a place to set up camp. The goal is to get all nodes onto the same version in one sitting. When the last node is done, every service has been running continuously the whole time, and the cluster is on the new version. That’s the no-downtime part.
The cautions that actually matter
I want to be candid here, because “zero downtime” makes this sound more carefree than it should be. A few things are non-negotiable.
Back up first. Confirm your backups are current and that you’ve actually test-restored one recently. A rolling upgrade is low-risk, but “low” is not “zero,” and the point of a backup is to be there on the day the math doesn’t go your way. If you can’t restore, you don’t have a backup — you have a folder full of hope.
Read the release notes. Every major version ships an official upgrade guide and a known-issues list. Read it before you start, not after something behaves oddly — major versions occasionally change defaults, deprecate something you depend on, or insist you be fully patched on the current version first. Five minutes of reading saves an hour of confusion.
Test the migrations before you commit. Live migration depends on the hosts being compatible enough — storage reachable from both ends, CPUs presenting a common feature set, networking that lets the memory copy flow. Before you drain a production host for real, migrate a throwaway test VM around the cluster and watch it succeed. Find out with a disposable VM, not the DNS server.
Go one node at a time, and verify between each. The discipline is the whole technique. Don’t upgrade two nodes in parallel to save time. Finish one, confirm it’s healthy and rejoined, then touch the next — so if something’s wrong, only one node is affected and the rest of the cluster keeps holding the line.
The quietly satisfying part
The best infrastructure work is the kind nobody ever sees. There’s a specific, nerdy joy in watching a progress bar tick across as a running virtual machine relocates to another box, knowing the photo upload someone’s doing right now didn’t even hiccup. You reboot a server that an hour ago carried a third of everything you run — and the dashboards stay green.
It’s production-style change management on hardware you own, at stakes low enough that a mistake is a lesson instead of an incident. Do it a few times and the next major version stops being a project you dread and becomes a quiet evening with a backup, the release notes, and a little patience.
Comments
No comments yet — be the first.