Not so resilient filesystem
Azure Stack HCI - or Azure Stack Local, as its now called - has been quite the ride. Today we went through what should have been a fairly routine task: upgrading the firmware on our physical Dell nodes across all stacks.
Now, you’d think that with Dell and Microsoft working hand-in-hand, this would be straightforward. But chatting with Dell engineers during the process quickly showed me that even they find allot of maintainance and Microsoft's choices painful. Microsoft really has a talent for making simple things unnecessarily difficult. So no, it’s not just us struggling...
23H2
And to make things more interesting: we’ve got one particular stack that’s nothing but trouble. It’s been a headache from day one, and the recent upgrade from 22H2 to 23H2 really pushed it over the edge.
I’ll be honest, this was my first time maintaining an Azure Stack Local environment. I’ve worked with failover clusters before, and on the surface, it feels similar. But the Azure sprinkles are really weird
After the upgrade, we lost five Cluster Shared Volumes (CSVs). Yep, five.
And with ReFS (specifically CSVFS_Refs), that really shouldn’t happen. Especially since we were running the upgrade manually instead of through Cluster Aware Updating - carefully draining nodes, stopping services and waiting for storage repair and resilvering jobs to complete, and doing things by the book.
But nope. The volumes just died. In Dutch we’d call that KUT.
Thankfully, we had backups. But restoring them meant losing about four hours of data. So instead, I rolled up my sleeves and tried salvaging the volumes directly. You can't withhold me from a challenge! GRR!
Refsutil
Enter refsutil.exe: Microsoft’s tool for ReFS volumes.
Here’s how i used it:
- Mounted the virtual volumes using two nodes:
- One as the owner of the broken shared volume.
- One running a custom recovery role with the volume attached as a resource. So i would be able to do 2 at a time
- Used
diskpart
to find the virtual disk, select it, and assign it a drive letter. - Created a fresh recovery volume in our storage pool:
New-Volume -FriendlyName "Recovery" -FileSystem CSVFS_ReFS -StoragePoolFriendlyName S2D* -Size 5TB
- Ran refsutil salvage:
refsutil salvage -QA D: C:\ClusterStorage\Recovery\temp C:\ClusterStorage\Recovery\recovered -x -v
The key options we used:
-Q
→ Quick mode (faster scan; Our data was in great shape; corruption was more of a phantom).-A
→ Salvage everything possible.-x
→ Extract salvaged files into the destination folder.-v
→ Verbose output for progress.
- Finally, replaced all role's linked storage with the recovered volumes ending with _2. With a Powershell script
And it worked perfectly.
Firmware Upgrade
Fast-forward to today. We were upgrading firmware again, this time with a Dell engineer watching. Guess what?
We lost another volume.
Even the engineer admitted he had never seen that happen before. Just our luck.
Naturally, I thought I could show off my recovery skills. But then… surprise twist: Microsoft had silently upgraded the ReFS version when we elevated the cluster functional level to 23H2.
Refsutil simply doesn’t support salvaging the new version. And the kicker? The updated refsutil isn’t even included in Azure Stack Local 23h2.
I tried grabbing refsutil from a Windows Server 2025 build, only to be greeted with: “not supported.”
Production Is Preview, Preview Is Production
So here we are again: end users, effectively beta-testing in production. Microsoft seems to have done it again: where “preview” behaves like production, and “production” feels like preview.
Honestly, I don’t know what’s going on anymore. What I do know is this: this was the last time I’ll be touching Azure Stack Local.
Proost 🍻,
P.S are we the only one where Azure Stack local is in a botched state where Windows Admin Center says we have to update in Azure and Azure says we have to be on 23H2 to update?
Azure - WAC. Whatever, we don't trust CAU anyways. So manual it was