A cluster inside your cluster: how I test scary upgrades

June 2, 2026

proxmoxtestinghomelabvirtualization

In the last post I made the no-downtime cluster upgrade sound calm — drain a node, bump it, verify, move on, nobody notices. That calm has a secret backstory, and the backstory is this: by the time I do that upgrade for real, I’ve already done it once or twice somewhere that didn’t matter. This is the post about somewhere that didn’t matter.

Here’s the honest version of how I avoid being scared of upgrades. I never run a new major version, a weird new feature, or a day-one release against the production cluster first. I run it against a couple of small throwaway clusters that live inside the main one. They’re disposable. Their entire job is to absorb the breakage so the real thing never has to.

Don’t test on prod. You knew that.

Everyone knows “don’t test in production.” Fewer people have a comfortable place to test instead, so they end up doing it in prod anyway and calling it “being careful.” The trick isn’t willpower — it’s having a sandbox that’s so cheap and quick to spin up that using it is the path of least resistance.

The good news for self-hosters: you almost certainly already own the hardware to build that sandbox. It’s the same box you’re already running everything on.

A cluster inside your cluster, in plain terms

The thing that makes this possible is nested virtualization — and the name is scarier than the idea. Your hypervisor’s whole job is to run virtual machines. Nested virtualization just means one of those virtual machines is itself allowed to run virtual machines. You flip a setting, and suddenly a VM can pretend to be a real physical server convincingly enough to host its own little world.

So I take a few of those VMs, install the same hypervisor I run in production onto them, and have them join hands into a miniature cluster. It behaves like the real cluster — same management screen, same live migration, same upgrade steps — except every “server” in it is really just a VM on the big cluster. A cluster inside a cluster. Tiny, slow, and completely fake in the ways that matter for safety, completely real in the ways that matter for testing.

Why two of them

One sandbox is good. Two is where it gets genuinely useful, because two lets you compare instead of just try.

I keep two of these little throwaway pools around. The usual arrangement is one running the current stable version — a stand-in for what production looks like today — and one I’m willing to drag out to the bleeding edge. That setup answers the two questions I actually care about:

  • Does the upgrade path work? I run the exact rolling upgrade on the stable pool and watch it go from old to new. If it’s going to choke on a changed default or a deprecated thing, I’d much rather find out here.
  • Is the new shiny thing worth it? I throw the experimental release or the half-baked feature at the second pool and just use it. Stable pool next to it for honest before-and-after, no guessing.

And sometimes the second pool exists purely so I can break one freely while the other stays in a known-good state to compare against. Having a control group makes “huh, that’s weird” a lot easier to chase down.

Disposable is the whole point

The magic ingredient isn’t the cluster — it’s that I genuinely do not care if I destroy it.

Before I try something sketchy, I snapshot the VMs. Then I do the sketchy thing. If it works, great. If it wedges the whole pretend cluster into an unbootable mess, I roll the snapshot back and it’s like it never happened. And if I’ve truly made a hash of it, I don’t even bother repairing — I delete the VMs and rebuild the pool from scratch, which on this little a scale takes minutes, not an evening.

That freedom changes how you experiment. When wrecking the environment costs nothing, you stop tiptoeing. You try the upgrade the wrong way on purpose just to see what the failure looks like, so you’ll recognize it if it ever shows up for real. You learn the shape of the disaster in a place where the disaster is free.

How this makes the real upgrade boring

This is the part that ties back to the last post. The reason I can drain a production node and bump a major version without sweating is that the scary unknowns already got spent on the sandbox. By the time I touch the real cluster, the rolling upgrade isn’t an experiment — it’s a rerun. I’ve seen where it stumbles, I’ve read the release notes and watched them come true, and I know what healthy looks like on the other side.

You don’t need any of this to self-host. It’s firmly in “because it’s fun and I learn things” territory. But a little fake cluster you’re allowed to set on fire is one of the best deals in the whole hobby: it turns the most nerve-wracking maintenance you do into a quiet, well-rehearsed evening. Build the sandbox. Break it on purpose. Then go be boring in production.

← all posts

Comments

No comments yet — be the first.

Leave a comment

Moderated before it appears.
Theme
Font