FW13 AMD AI 300 (HX 370): 48 Data Fabric Sync Flood crashes in 2 months — comprehensive data

Hi everyone,

I’m sharing a detailed report of persistent Data Fabric Sync Flood crashes (0x08000800) on my Framework 13 AMD Ryzen AI 300 in the hope that the data helps Framework and AMD engineers root-cause this issue. I’ve been systematically logging every crash since December 2025.

@Jesse_Darnley mentioned finding a reproducible trigger in June 2025 (power-adapter related, since fixed), but I haven’t seen further updates. This post adds a large, methodical dataset from a different angle: my crashes happen during normal use, not just at sleep/wake, and reproduce on a stock Ubuntu live USB — ruling out custom kernels and installed software.

System Information

Component Value
Laptop Framework Laptop 13 (AMD Ryzen AI 300 Series)
CPU AMD Ryzen AI 9 HX 370 w/ Radeon 890M
RAM 2×48 GB Crucial DDR5 (96 GB total); originally 1×48 GB (Framework stock)
Storage 1 TB WD_BLACK SN770 NVMe, firmware 731100WD
Wi-Fi Intel AX210
BIOS 03.05 (2025-10-30)
Kernel 6.18.0-fw13 (custom built from mainline); previously 6.14-1016 (Ubuntu)
OS Ubuntu 24.04.3 LTS
Kernel args amdgpu.dcdebugmask=0x12 (disables PSR + Stutter mode); just changed to 0x412 (adds Panel Replay disable)
Power profile Balanced

The Problem

The dmesg message after every crash:

x86/amd: Previous system reset reason [0x08000800]: an uncorrected error caused a data fabric sync flood event

The crash is near-instantaneous — no kernel panic, no oops, no pstore data, no kdump capture. The hardware simply resets. Occasionally I notice a brief freeze (~5 seconds) before the reset, sometimes with a CPU core spiking to 100% in the system monitor. The only post-mortem evidence is the reset reason register read at next boot.

Crash Statistics: 48 Sync Floods

I log every crash with DIMM temperatures (from collectd/spd5118), awake uptime between crashes, and activity at crash time. DIMM temperature monitoring was added starting crash #7. Here is the full table:

# Date Uptime RAM Kernel DIMM temps (°C)
1 2025-12-02 11:58 ? 1×48 6.14
2 2025-12-02 12:35 < 1 h 1×48 6.14
3 2025-12-03 20:15 ~28 h 1×48 6.14
4 2025-12-11 18:38 ? 2×48 6.14
5 2025-12-11 19:28 < 1 h 2×48 6.14
6 2025-12-11 20:13 < 1 h 2×48 6.14
7 2025-12-15 15:46 ~41 h 2×48 6.18 56–61
8 2025-12-23 16:19 ~7 h 2×48 6.18 47–50
9 2025-12-24 10:24 ~1 h 2×48 6.18 59–67
10 2025-12-25 07:04 ~21 h 2×48 6.18 61–66
11 2025-12-25 14:48 ~8 h 2×48 6.18 50–53
12 2025-12-26 04:56 ~2 h 2×48 6.18 65–72
13 2025-12-26 06:36 ~1 h 24 2×48 6.18 42–46
14 2025-12-28 05:50 ~23 h 2×48 6.18 47–51
15 2025-12-31 04:30 ~35 h 2×48 6.18 52–54
16 2025-12-31 12:09 ~4 h 2×48 6.18 68–73
17 2026-01-01 07:16 ~10 h 2×48 6.18 57–71
18 2026-01-01 10:06 ~3 h 2×48 6.18 60–66
19 2026-01-06 09:00 ~64 h 2×48 6.18 57–60
20 2026-01-06 10:39 ~1 h 37 2×48 6.18 61–65
21 2026-01-06 11:32 ~51 min 2×48 6.18 52–54
22 2026-01-07 08:39 ~12 h 2×48 6.18 56–66
23 2026-01-10 10:24 ~41 h 2×48 6.18 57–64
24 2026-01-12 02:54 ~23 h 2×48 6.18 49–51
25 2026-01-12 15:31 ~12 h 2×48 6.18 54–58
26 2026-01-14 05:53 ~20 h 2×48 6.18 55–57.5
27 2026-01-15 10:58 ~21 h 2×48 6.18 57–62
28 2026-01-15 13:09 ~2 h 2×48 6.18 50–53
29 2026-01-17 01:14 ~18 h 2×48 6.18 48.5–64
30 2026-01-19 05:49 ~26 h 2×48 6.18 51.5–53.5
31 2026-01-20 11:36 ~20 h 2×48 6.18 75–81
32 2026-01-24 08:29 ~54 h 2×48 6.18 61–71
33 2026-01-26 04:09 ~14 h 2×48 6.18 56–63
34 2026-01-27 07:47 ~18 h 2×48 6.18 62–71.5
35 2026-01-27 10:04 ~2 h 17 2×48 6.18 63–68.5
36 2026-01-28 03:45 ~11 h 2×48 6.18 54–61.5
37 2026-01-28 04:02 ~15 min 2×48 6.18 62–69
38 2026-01-30 13:27 ~37 h 30 2×48 6.18 57.5–61.5
39 2026-01-31 08:15 ~9 h 58 2×48 6.18 60–71
40 2026-01-31 08:37 ~22 min 2×48 6.18 60–67
41 2026-01-31 08:45 ~7 min 2×48 6.18 63.5–72.5
42 2026-01-31 12:44 ~3 h 55 2×48 6.18 48–51
43 2026-02-01 10:54 ~8 h 53 2×48 6.18 50–52.5
44 2026-02-01 16:19 ~8 min 2×48 6.18 60–67
45 2026-02-02 17:26 ~18 h 2×48 6.18 62.5–67.5
46 2026-02-03 01:41 ~1 h 36 2×48 6.18 62–68
47 2026-02-03 01:55 ~13 min 2×48 6.18 64.5–72
48 2026-02-03 ~04:50 ~2 h 51 2×48 6.11* 67–69

* Crash #48 occurred on a stock Ubuntu 24.04.3 live USB (kernel 6.11.0-17-generic, no custom kernel args, no amdgpu.dcdebugmask, no encrypted root, no collectd/Docker). Same 0x08000800 reset code.

Uptime between crashes ranges from 7 minutes to 64 hours. Average is roughly 12–15 hours of awake time. Last week (Jan 27 – Feb 3): 15 crashes, average ~7 h 40 min, min 7 min, max 37 h 30 min — the frequency is increasing. Note: uptime is cumulative awake time only — suspend periods are excluded. Longer uptimes span multiple wake/suspend cycles (e.g., the 64 h entry spans 11 sessions over 5 days).

What I’ve Ruled Out

Variable Tested Result
Temperature Crashes at DIMM temps 42–46 °C (cold) and 75–81 °C (hot). Ran 1 h 45 min video call at 75–77 °C without crash. Ran 1 h+ session at DIMM2 83–84 °C SPD (potentially 96–101 °C hotspot) without crash. Next crash was at 57 °C. Not the cause
Cooling Used laptop cooling stand with fans for weeks — dramatic temp reduction, zero impact on crash frequency Not the cause
Kernel 6.14-1016 (Ubuntu stock), 6.18.0-fw13 (custom mainline), 6.11.0-17 (stock Ubuntu live USB). 6 crashes on 6.14, 41 on 6.18, 1 on stock 6.11 Not the cause
Custom software Live USB test: stock Ubuntu 24.04.3, no custom kernel args, no amdgpu.dcdebugmask, no encrypted root, no collectd/Docker — crashed after ~2 h 19 min Not the cause
RAM config 1×48 GB from Framework → 2×48 GB Crucial DDR5 Not the cause
iGPU VRAM BIOS: 0.5 GB → 16 GB No effect
Power supply Framework charger + third-party 100 W PSU No effect
CPU load Crashes during idle, during terminal work, during compilation, during Firefox No correlation
amdgpu PSR amdgpu.dcdebugmask=0x12 — this fixed an earlier, much worse crash pattern (crashes within minutes of boot). Sync floods still occur with it. Mitigates a different issue

What I Haven’t Tried Yet

  • amdgpu.dcdebugmask=0x412 — just applied, adds Panel Replay disable (DC_DISABLE_REPLAY) to my existing PSR + Stutter disable. No data yet on whether it changes crash frequency.

Key Observations

  1. Reproduces on stock Ubuntu live USB. Crash #48 occurred on an unmodified Ubuntu 24.04.3 live USB (kernel 6.11.0-17-generic) — no custom kernel args, no amdgpu.dcdebugmask, no encrypted root, no installed software. This rules out my kernel build, configuration, and software stack as contributing factors. The issue is firmware or hardware.

  2. amdgpu.dcdebugmask=0x12 mitigates a related but separate issue. Without it, my first install on the HX 370 board had crashes within minutes of boot — sometimes before the kernel fully loaded. With it, I get daily-ish crashes instead. However, the live USB crashed after ~2 h 19 min without this flag, suggesting the display controller / PSR triggers a more aggressive crash pattern, while the sync floods are a distinct underlying problem.

  3. Crashes happen during active use AND idle. Several crashes occurred while I was away from the computer (lid open, system idle, no screensaver). One notable crash (#14) happened a few minutes after I left to eat — could be a power state transition.

  4. Clustering pattern: Jan 31 had 4 crashes (08:15, 08:37, 08:45, 12:44). Once the system starts crashing, it tends to crash again soon. The first three were only 22 min and 7 min apart.

  5. The RDSEED bug exists on my CPU: RDSEED32 is broken. Disabling the corresponding CPUID bit. — This is a known AMD hardware bug on the HX 370. While the kernel works around it for random number generation, it signals silicon-level issues on this platform.

What Would Help

  • Framework engineering: Is there any firmware/EC diagnostic I can run? I’m happy to install fw-ectool, run custom kernels, or enable any debug tracing you need. I have collectd logging temperatures, detailed Framework diagnostic logs for each crash, and can provide anything else.

  • Other FW13 AI 300 (HX 370) users: Are you seeing 0x08000800 in your dmesg? Run journalctl -b 0 | grep "reset reason" after an unexpected reboot. Please report your findings here. Also, if your FW13 HX 370 is running stable on Linux, I’d love to hear about it — I’m trying to determine whether this is a widespread platform issue or specific to my unit, and positive data points matter as I’m considering a replacement.

  • Framework team: It would help the community to know roughly how many RMAs have been filed for sync flood / 0x08000800 crashes on any FW (FW13/FW16) using AMD procs. Understanding whether this affects a small batch or a significant portion of units would help owners decide whether to wait for a fix or request a replacement.

RMA Status

I have an open support ticket with Framework. They’ve asked me to provide diagnostic logs (using their log-helper script), which I’ve done for every crash. Awaiting next steps.

Related Threads & References

Framework Community:

GitHub:

Non-Framework reports (same error):

Kernel / AMD:

3 Likes

What devices do you have plugged into the card slots? I have proved that devices can cause this.

Devices tested had - at the same time - enough variations, and enough stability, to undermine the hypothesis that they may participate in the issue. I have 2 setups with different screen/docks, and I also tested with just nothing connected. I had at least one crash each times. Many crashes occurred with the same setup and no changes in the setup in all the runs in various frequencies (minutes after boot to several days). The crash never happened while connecting/disconnecting a device. Some of my devices have clear issues and are faulty (I have a dock with visible issues after some times) which triggers many connection problems in Ubuntu, or on hardware detection, but I never could connect it to a crash in a clear way. I connected it to 2 different 4k screens in long period of time through to different dock and tried also directly, with no changes in the crash frequency. The crash happened also without any dock nor any device connected on it.

By any chances, do you own a Framework on AMD ? Do you have random freezes ?

Hi,

I started this, so yes:

For background, there have been multiple false negatives investigating this, so nothing should be ruled out unless physically proved at your own hands, or at least reproduced by multiple people.

It might also be mitigated with kernel parameter:
processor.max_cstate=1

Can you see if it helps your situation, as you seem to be able to reproduce it more than me.

Taken from:

Update: 85 sync floods in 3 months (was 48 in 2 months)

37 more crashes since my original post. Now at 85 Data Fabric Sync Flood crashes over 3 months (Dec 2025 – Mar 2026). Here’s what I’ve tested and learned.

New mitigations tested — neither worked

amdgpu.dcdebugmask=0x412 (applied Feb 4): upgraded from 0x12, adds Panel Replay disable. 37 sync floods since. No improvement.

processor.max_cstate=1 (restricts CPU to C1 halt, no deeper C-states): tested twice.

  • First run (Feb 5–8): zero sync floods, but it caused severe suspend hangs — the system couldn’t wake from s2idle. Three consecutive suspend hangs, one iwlwifi soft lockup, one networking failure. The suspend hangs were a worse problem that may have hidden the sync floods during this period.
  • Second run (Feb 27+): re-added with suspend disabled to avoid the hang issue. 4 sync floods in ~3 days at a similar rate to without it. No effect.

Crash clustering continues

  • Feb 19: 5 crashes in one day (08:16, 10:03, 10:26, 10:43, 11:28 — four within 85 min)
  • Feb 23: 3 crashes in 73 minutes (01:24, 01:28, 02:37)
  • Uptime between crashes still ranges from 1 minute to 35 hours

New crashes (#49–85)

# Date Uptime DIMM temps (°C) Notes
49 2026-02-04 14:48 ~24 h 26 69.5–74.5 first with dcdebugmask=0x412
50 2026-02-04 15:17 ~28 min 65–70.5
51 2026-02-04 16:09 ~50 min 59–72.5
52 2026-02-05 08:46 ~7 h 47 72.5–80.5
Feb 5–8: max_cstate=1 active — 0 sync floods, 3 suspend hangs
53 2026-02-08 06:25 ~1 min 51–67 first boot after removing max_cstate=1
54 2026-02-12 10:13 ~1 h 48 62–67
55 2026-02-14 02:06 ~21 h 15 58.5–64
56 2026-02-14 04:55 ~2 h 45 46.5–50
57 2026-02-15 08:45 ~6 h 15 65–69.5
58 2026-02-15 11:42 ~2 h 56 72–80
59 2026-02-16 01:05 ~7 h 54 78–85
60 2026-02-17 06:36 ~13 h 28 74.5–80
61 2026-02-17 14:34 ~7 h 54 65–73
62 2026-02-18 10:38 ~12 h 16 63–70
63 2026-02-18 12:24 ~1 h 46 69–74
64 2026-02-19 08:16 ~11 h 55 57–65
65 2026-02-19 10:03 ~1 h 46 58–66
66 2026-02-19 10:26 ~21 min 60–66.5
67 2026-02-19 10:43 ~17 min 60.5–66
68 2026-02-19 11:28 ~45 min 54.5–64
69 2026-02-20 01:00 ~9 h 01 71–80
70 2026-02-20 01:31 ~30 min 63–71.5
71 2026-02-21 15:56 ~27 h 38 56–68
72 2026-02-22 07:18 ~7 h 12 72.5–79.5
73 2026-02-22 07:43 ~25 min 61–75.5
74 2026-02-22 10:49 ~3 h 06 63.5–70
75 2026-02-23 01:24 ~7 h 26 64–70.5
76 2026-02-23 01:28 ~4 min 62–73.5
77 2026-02-23 02:37 ~1 h 08 61–65.5
78 2026-02-24 02:37 ~13 h 45 64–72
79 2026-02-24 03:06 ~27 min 63.5–70
80 2026-02-24 08:53 ~4 h 62–65.5
81 2026-02-27 04:08 ~35 h 25 62–67
82 2026-02-27 14:00 ~9 h 48 65–75 max_cstate=1 re-added, suspend disabled
83 2026-02-28 06:05 ~16 h 03 51–55.5 max_cstate=1, suspend disabled
84 2026-03-01 11:19 ~28 h 36 61–67.5 max_cstate=1, suspend disabled
85 2026-03-01 19:00 ~7 h 38 51–54 max_cstate=1, suspend disabled

All crashes on 2×48 GB, kernel 6.18.0-fw13, amdgpu.dcdebugmask=0x412. DIMM temps from collectd/spd5118.

Where things stand

Everything I can change on the software side has been tried. The crash reproduces across:

  • 3 kernels (6.14, 6.18, stock 6.11 live USB)
  • With and without amdgpu.dcdebugmask (0x12, 0x412, none)
  • With and without processor.max_cstate=1
  • With and without suspend
  • At DIMM temps from 42 °C to 85 °C
  • During idle and under load

I’m running out of things to try on my end. @Jesse_Darnley, @Matt_Hartley — any update on sync flood investigation? Happy to run any firmware/EC diagnostics or test patches.

I am curious if this also occurs on Windows or if this is a error in the Linux stack somewhere.
That would point out the difference between a hardware/firmware issue or a software issue.

I’m sorry you’ve had such a hard time with your mainboard, my 7840U hasn’t had a single crash yet.

Major update: WD SN770 firmware update — zero crashes in 55+ hours

Updating from 102 sync floods in 3.5 months to what may be the fix — or at least a very significant mitigation.

TL;DR

I updated the WD_BLACK SN770 1TB NVMe firmware from 731100WD to 731120WD. Since then: 55 hours and 25 minutes of cumulative awake time across 7 sessions, zero crashes. For context, I was averaging a crash every ~12–15 hours, with some days having 5 crashes.

How I got here

After 102 Data Fabric Sync Flood crashes and exhausting every software-side mitigation (3 kernels, dcdebugmask, processor.max_cstate=1, stock Ubuntu live USB — all crashed), I was running out of options. I have an open RMA with Framework support, but honestly the experience has been frustrating — slow responses, and the suggestions (run the log-helper script, try with one RAM stick at a time in each slot) felt like generic troubleshooting that didn’t account for the extensive testing I’d already done and shared with them. I’d already tried different RAM configs, different kernels, a stock live USB — the data was all there. So I kept investigating on my own, and turned to the PCIe link to the NVMe SSD.

Step 1 — Making PCIe errors visible: The Framework BIOS (via AMD AGESA) refuses to grant AER (Advanced Error Reporting) control to the OS. This means the kernel is completely blind to PCIe errors — they happen silently with no logging, no interrupts, no recovery. I added pcie_ports=native to bypass this and force the kernel’s AER driver to activate.

Step 2 — What I found: The NVMe link was generating correctable PCIe errors continuously — about 30 per awake-hour. RxErr (receiver errors) and BadTLP (corrupted packets) on the SSD, Timeout (completion timeouts) on the root port. Errors came in correlated pairs: a corrupted packet arrives → the receiver rejects it → the sender never gets an acknowledgment → timeout. This is the signature of a marginal PCIe link.

Step 3 — A community post that connected the dots: A FW16 user in the thread Framework 16 Re-occurring BSOD on this very forum reported that updating their WD SN770 firmware via WD Dashboard fixed their recurring crashes. That post is what directly led me to try this — credit where it’s due.

It made sense of the PCIe errors: the WD SN770 is a DRAM-less NVMe that uses Host Memory Buffer (HMB) — it borrows 200 MB of your system RAM via PCIe for its internal operations. WD issued a critical firmware advisory for HMB bugs causing BSODs on Windows 11 24H2, and the Proxmox/OpenZFS community confirmed HMB problems affect non-Windows OSes too. The mechanism fits perfectly: buggy HMB firmware → erratic PCIe transactions → correctable errors escalate → Data Fabric can’t recover (because AER is disabled) → Sync Flood.

A note on WD’s advisory scope: WD’s advisory only lists the 2 TB models (SN770 2TB, SN770M 2TB) as affected. My drive is a 1 TB — not mentioned in the advisory at all. Yet it uses the same 200 MB HMB (confirmed via nvme id-ctrl), and appears to have been suffering from the exact issue the advisory describes. If the crash-free streak holds, WD’s advisory is incomplete — the 1 TB SN770 should be listed as an affected model, and the issue is not limited to Windows 11 24H2 BSODs. Linux users experiencing Data Fabric Sync Floods would have no reason to think their 1 TB drive needs this update based on WD’s current documentation.

Step 4 — The firmware update: On March 8, immediately after crash #102, I updated to 731120WD. No crashes since. This is by far the most significant change in my entire 3.5-month investigation.

How to update (Linux, no Windows needed)

# Download firmware
curl -k -o /tmp/731120WD.fluf \
    "https://wddashboarddownloads.wdc.com/wdDashboard/firmware/WD_BLACK_SN770_1TB/731120WD/731120WD.fluf"

# Flash to firmware slot 2 (slot 1 keeps old firmware as fallback)
sudo nvme fw-download /dev/nvme0 -f /tmp/731120WD.fluf
sudo nvme fw-commit -s 2 -a 3 /dev/nvme0

# Reboot to activate
sudo reboot

References: sorend’s gist, Framework community WD update guide. The -k flag on curl is needed because WD’s CDN SSL certificate was expired at time of download.

Who should try this

If you have a WD NVMe drive (SN770, SN770M, SN850X, or similar DRAM-less models) and are experiencing sync floods — check your firmware version and update if possible. These drives all use HMB, and HMB bugs can generate the kind of PCIe errors that the Data Fabric would choke on.

Check your current firmware:

sudo nvme id-ctrl /dev/nvme0 | grep -i "fr "

Caveats

  • 55 hours is promising but not definitive proof. My longest previous streak was ~64 hours (before crash #19). I’ll continue monitoring and update this thread.
  • The PCIe correctable errors (RxErr, BadTLP) may still be present after the firmware update — what matters is whether the firmware update eliminates the conditions that cause them to escalate to uncorrectable errors that trigger a Sync Flood.
  • This may not explain all sync floods across all hardware configurations. But for anyone with a WD DRAM-less NVMe, the firmware is the lowest-hanging fruit to try.

Framework and AMD: this needs your attention

@Jesse_Darnley @Matt_Hartley — I’m flagging this explicitly because after 3.5 months and 102 crashes, this is the first concrete, actionable lead pointing to a specific component with a specific fix. Not a kernel parameter workaround, not a “try this and hope” — a firmware update with a plausible mechanism backed by PCIe error data and a WD critical advisory. I’d really appreciate acknowledgment and feedback from the Framework engineering team.

Specifically:

  1. Should this be relayed to the AMD BIOS/firmware team ? AMD told us on GitHub that debugging sync floods “needs to be done by Framework BIOS team.” The PCIe AER data and NVMe firmware correlation give them something concrete to investigate — this is no longer a “random unreproducible crash.”

  2. Should enabling AER in the BIOS be considered ? The current AMD AGESA configuration refuses to grant PCIe Advanced Error Reporting control to the OS. This means every Framework AMD laptop is completely blind to PCIe errors — they happen silently with no logging, no interrupts, no recovery. I only found the correctable error stream on my NVMe link by forcing pcie_ports=native to bypass the BIOS. Without that, I’d still be in the dark after 102 crashes. Enabling AER would immediately give every affected user — and your own support team — visibility into what’s going wrong.

  3. Should you add NVMe firmware version to the sync flood diagnostic workflow ? When a user reports 0x08000800, the first question should be: what NVMe drive and firmware version? WD DRAM-less drives (SN770, SN770M, SN850X) use Host Memory Buffer and should be flagged for firmware updates.

  4. Is there internal data on NVMe models across sync flood RMAs? If WD DRAM-less drives are overrepresented, that would confirm this finding and could prevent unnecessary mainboard replacements.

This issue has been affecting users across FW13, FW16, multiple AMD CPUs, and multiple configurations for over a year. I know I’m not alone in feeling that the community response to sync floods has been lacking — on GitHub, some users have switched to Intel boards, others have expressed real disappointment in how Framework has handled this. A clear, engaged response here would matter to a lot of people who are watching these threads and wondering whether to keep trusting the platform.

4 Likes

@Valentin_Lab
That is a really good find.
After i found out (proved) that PCIe devices can cause a sync flood. What you have found here makes a lot of sense. I was not aware that the FW BIOS was suppressing PCIe errors.
A sync flood causes a forced reboot. I don’t see how windows can do a BSOD for a sync flood.
Note: it is quite normal for BIOS to suppress PCIe errors because not many BIOS support pcie error recovery methods. But the linux kernal does support pcie error recovery.
This might also help some oculink users track down problems.
@Mario_Limonciello In case this is useful for you.

3 Likes

I’ve been fighting similar crashes on a FW13 running the AMD 350. It’s made the machine unusable for work since I never know when the rug will get pulled. I’ve had nearly every piece of hardware replaced thanks to FWs great RMA help, but no real resolution. I found reliable repro with transcoding video in shotcut. I also found that the problem is less frequent under Fedora 43 than it was under Ubuntu. The only piece that hasn’t been replaced yet is my WDBlack SN850X. Thanks for the guidance on the firmware, I’ll update mine and see if it makes any difference. This feels like a really interesting common thread. I actually found this doing some research while I install windows on an older SSD I had lying around to see if I could repro the problem there, but I now have a new avenue to explore.

Edit - There was a new firmware available for my drive. I’ve updated and we’ll see how it goes.

@Quentin_Hartman

Please try the kernel parameters:
pcie_ports=native pcie_ecrc=on

Then see if you get any AER errors in the logs on Linux.
If you are seeing AER errors, then it should tell you what device is causing the problem.

I’ll add that into the mix. Thanks!

OK, so after updating the firmware on the drive, I’ve three workdays now with no rugpulls, so this is a substantial improvement.

@James3 I added those kernel parameters. When I search my logs for AER, I see a bunch of messages like this:

```
Mar 14 21:35:36 emerald kernel: acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug AER]
```
but they are all timestamped from before I updated the firmware on the drive and added these params. Am I looking for the right thing?

The AER errors in the logs would have looked something like this, before you upgraded the firmware. As you have upgraded the firmware, you should not see any AER errors in the logs. So, it looks like the firmware update fixed the problem for you.

[ 7542.204821] pcieport 0000:00:04.1: AER: Correctable error message received from 0000:63:01.0
[ 7542.204837] pcieport 0000:63:01.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
[ 7542.204841] pcieport 0000:63:01.0: device [8086:15da] error status/mask=00000080/00002000
[ 7542.204846] pcieport 0000:63:01.0: [ 7] BadDLLP

I’ve had something like 70 powered-on hours since doing this update, including some heavy activities that are known to be reliable reproducing activities for the rugpull, and haven’t gotten any. At this point I consider the update of the firmware on my SSD a solution to this problem. Would love to see the support team include this possibility into their troubleshooting process for problems like mine. If they knew to ask about this and had guidance for upgrading the SSD firmware, this problem would have likely been solved for me months ago. Huge thanks to @Valentin_Lab for this post, my machine would still be essentially unusable if not for the information you shared!

1 Like

I’ve reached out to Framework directly on X: link. If you’ve been affected by this issue, amplifying would help get their attention.