Major update: WD SN770 firmware update — zero crashes in 55+ hours
Updating from 102 sync floods in 3.5 months to what may be the fix — or at least a very significant mitigation.
TL;DR
I updated the WD_BLACK SN770 1TB NVMe firmware from 731100WD to 731120WD. Since then: 55 hours and 25 minutes of cumulative awake time across 7 sessions, zero crashes. For context, I was averaging a crash every ~12–15 hours, with some days having 5 crashes.
How I got here
After 102 Data Fabric Sync Flood crashes and exhausting every software-side mitigation (3 kernels, dcdebugmask, processor.max_cstate=1, stock Ubuntu live USB — all crashed), I was running out of options. I have an open RMA with Framework support, but honestly the experience has been frustrating — slow responses, and the suggestions (run the log-helper script, try with one RAM stick at a time in each slot) felt like generic troubleshooting that didn’t account for the extensive testing I’d already done and shared with them. I’d already tried different RAM configs, different kernels, a stock live USB — the data was all there. So I kept investigating on my own, and turned to the PCIe link to the NVMe SSD.
Step 1 — Making PCIe errors visible: The Framework BIOS (via AMD AGESA) refuses to grant AER (Advanced Error Reporting) control to the OS. This means the kernel is completely blind to PCIe errors — they happen silently with no logging, no interrupts, no recovery. I added pcie_ports=native to bypass this and force the kernel’s AER driver to activate.
Step 2 — What I found: The NVMe link was generating correctable PCIe errors continuously — about 30 per awake-hour. RxErr (receiver errors) and BadTLP (corrupted packets) on the SSD, Timeout (completion timeouts) on the root port. Errors came in correlated pairs: a corrupted packet arrives → the receiver rejects it → the sender never gets an acknowledgment → timeout. This is the signature of a marginal PCIe link.
Step 3 — A community post that connected the dots: A FW16 user in the thread Framework 16 Re-occurring BSOD on this very forum reported that updating their WD SN770 firmware via WD Dashboard fixed their recurring crashes. That post is what directly led me to try this — credit where it’s due.
It made sense of the PCIe errors: the WD SN770 is a DRAM-less NVMe that uses Host Memory Buffer (HMB) — it borrows 200 MB of your system RAM via PCIe for its internal operations. WD issued a critical firmware advisory for HMB bugs causing BSODs on Windows 11 24H2, and the Proxmox/OpenZFS community confirmed HMB problems affect non-Windows OSes too. The mechanism fits perfectly: buggy HMB firmware → erratic PCIe transactions → correctable errors escalate → Data Fabric can’t recover (because AER is disabled) → Sync Flood.
A note on WD’s advisory scope: WD’s advisory only lists the 2 TB models (SN770 2TB, SN770M 2TB) as affected. My drive is a 1 TB — not mentioned in the advisory at all. Yet it uses the same 200 MB HMB (confirmed via nvme id-ctrl), and appears to have been suffering from the exact issue the advisory describes. If the crash-free streak holds, WD’s advisory is incomplete — the 1 TB SN770 should be listed as an affected model, and the issue is not limited to Windows 11 24H2 BSODs. Linux users experiencing Data Fabric Sync Floods would have no reason to think their 1 TB drive needs this update based on WD’s current documentation.
Step 4 — The firmware update: On March 8, immediately after crash #102, I updated to 731120WD. No crashes since. This is by far the most significant change in my entire 3.5-month investigation.
How to update (Linux, no Windows needed)
# Download firmware
curl -k -o /tmp/731120WD.fluf \
"https://wddashboarddownloads.wdc.com/wdDashboard/firmware/WD_BLACK_SN770_1TB/731120WD/731120WD.fluf"
# Flash to firmware slot 2 (slot 1 keeps old firmware as fallback)
sudo nvme fw-download /dev/nvme0 -f /tmp/731120WD.fluf
sudo nvme fw-commit -s 2 -a 3 /dev/nvme0
# Reboot to activate
sudo reboot
References: sorend’s gist, Framework community WD update guide. The -k flag on curl is needed because WD’s CDN SSL certificate was expired at time of download.
Who should try this
If you have a WD NVMe drive (SN770, SN770M, SN850X, or similar DRAM-less models) and are experiencing sync floods — check your firmware version and update if possible. These drives all use HMB, and HMB bugs can generate the kind of PCIe errors that the Data Fabric would choke on.
Check your current firmware:
sudo nvme id-ctrl /dev/nvme0 | grep -i "fr "
Caveats
- 55 hours is promising but not definitive proof. My longest previous streak was ~64 hours (before crash #19). I’ll continue monitoring and update this thread.
- The PCIe correctable errors (
RxErr,BadTLP) may still be present after the firmware update — what matters is whether the firmware update eliminates the conditions that cause them to escalate to uncorrectable errors that trigger a Sync Flood. - This may not explain all sync floods across all hardware configurations. But for anyone with a WD DRAM-less NVMe, the firmware is the lowest-hanging fruit to try.
Framework and AMD: this needs your attention
@Jesse_Darnley @Matt_Hartley — I’m flagging this explicitly because after 3.5 months and 102 crashes, this is the first concrete, actionable lead pointing to a specific component with a specific fix. Not a kernel parameter workaround, not a “try this and hope” — a firmware update with a plausible mechanism backed by PCIe error data and a WD critical advisory. I’d really appreciate acknowledgment and feedback from the Framework engineering team.
Specifically:
-
Should this be relayed to the AMD BIOS/firmware team ? AMD told us on GitHub that debugging sync floods “needs to be done by Framework BIOS team.” The PCIe AER data and NVMe firmware correlation give them something concrete to investigate — this is no longer a “random unreproducible crash.”
-
Should enabling AER in the BIOS be considered ? The current AMD AGESA configuration refuses to grant PCIe Advanced Error Reporting control to the OS. This means every Framework AMD laptop is completely blind to PCIe errors — they happen silently with no logging, no interrupts, no recovery. I only found the correctable error stream on my NVMe link by forcing
pcie_ports=nativeto bypass the BIOS. Without that, I’d still be in the dark after 102 crashes. Enabling AER would immediately give every affected user — and your own support team — visibility into what’s going wrong. -
Should you add NVMe firmware version to the sync flood diagnostic workflow ? When a user reports
0x08000800, the first question should be: what NVMe drive and firmware version? WD DRAM-less drives (SN770, SN770M, SN850X) use Host Memory Buffer and should be flagged for firmware updates. -
Is there internal data on NVMe models across sync flood RMAs? If WD DRAM-less drives are overrepresented, that would confirm this finding and could prevent unnecessary mainboard replacements.
This issue has been affecting users across FW13, FW16, multiple AMD CPUs, and multiple configurations for over a year. I know I’m not alone in feeling that the community response to sync floods has been lacking — on GitHub, some users have switched to Intel boards, others have expressed real disappointment in how Framework has handled this. A clear, engaged response here would matter to a lot of people who are watching these threads and wondering whether to keep trusting the platform.