Not so resilient filesystem

Azure Stack HCI - or Azure Stack Local, as its now called - has been quite the ride. Yesterday we went through what should have been a fairly routine task: upgrading the firmware on our physical Dell nodes across all stacks.

Now, you’d think that with Dell and Microsoft working hand-in-hand, this would be straightforward. But chatting with a Dell engineer during the process quickly showed me that even they find allot of maintenance and other Microsoft's choices painful. Microsoft really has a talent for making simple things unnecessarily difficult. So no, it’s not just you and us struggling...

23H2

And to make things more interesting: we’ve got one particular stack that’s nothing but trouble. It’s been a headache from day one, and the recent upgrade from 22H2 to 23H2 really pushed it over the edge.

I’ll be honest, this was my first time maintaining an Azure Stack HCI environment. I’ve worked with failover clusters before, and on the surface, it feels similar. But the Azure sprinkles are a really weird addition.

After the upgrade to 23H2, we lost five Cluster Shared Volumes (CSVs). Yep, five.

And with ReFS (spec. CSVFS_Refs), that really shouldn’t happen. Especially since we were running the upgrade manually instead of through Cluster Aware Updating; draining nodes, stopping services and waiting for storage repair and resilvering jobs to complete, and doing things all by the book.

But nope. The volumes just died. In Dutch we’d call that “KUT”.

Thankfully, we had backups. But restoring them meant losing about four hours of data. So instead, I rolled up my sleeves and tried salvaging the volumes directly. You can't withhold me from a challenge! GRR!

Refsutil

Enter refsutil.exe: Microsoft’s tool for ReFS volumes.

Here’s how i used it:

Mounted the virtual volumes using two nodes:
- One as the owner of the broken shared volume.
- One running a custom recovery role with the volume attached as a resource. So i would be able to do 2 at a time
Used diskpart to find the virtual disk, select it, and assign it a drive letter so refsutil has a mount point.
Created a fresh recovery volume in our storage pool:
New-Volume -FriendlyName "Recovery" -FileSystem CSVFS_ReFS -StoragePoolFriendlyName S2D* -Size 5TB
Ran refsutil salvage:
refsutil salvage -QA D: C:\ClusterStorage\Recovery\temp C:\ClusterStorage\Recovery\recovered -x -v

The parameter options we used:

-Q → Quick mode (faster scan; Our data was in great shape; corruption was more of a phantom).
-A → Salvage everything possible.
-x → Extract salvaged files into the destination folder.
-v → Verbose output for progress.

Finally, replaced all role's linked storage with the recovered volumes ending with _2. With a Powershell script

And it worked perfectly.

Firmware Upgrade

Fast-forward to yesterday. We were upgrading firmware again, this time with a Dell engineer onsite. Guess what?

We lost another volume.

Even the engineer admitted he had never seen that happen before. Just our luck.

Naturally, I thought I could show off my recovery skills. But then… a surprise plot twist: Microsoft had silently upgraded the ReFS version when we elevated the cluster functional level to 23H2.

Refsutil simply doesn’t support salvaging the new version. And the kicker? The updated refsutil isn’t even included in Azure Local 23h2.

I tried grabbing refsutil from a Windows Server 2025 build, only to be greeted with: “not supported.”

Turns out that salvage is not implemented in the latest build of refsutils…

Production Is Preview, Preview Is Production

So here we are again: end users, effectively beta-testing versions in production. Microsoft seems to have done it again: where “preview” has become production, and “production” is in preview.

Honestly, I don’t know what’s going on anymore. What I do know is this: this was the last time I’ll be touching Azure Stack Local.

Proost 🍻,

P.S are we the only one where Azure Stack local is in a botched state where Windows Admin Center says we have to update in Azure and Azure says we have to be on 23H2 to update?

Azure - WAC. Whatever, we don't trust CAU anyways. So manual it was