Strix Halo / gfx1151: gfx ring timeout under mundane GL load

Which Linux distro are you using?

Kubuntu

Which release version?

26.04 LTS (in-place upgrade from 24.04)

Which kernel are you using?

7.0.0-15.15 (stock Ubuntu)

Which BIOS version are you using?

3.0.3
3.0.4 has the long boot and hangs, 3.0.5 fixes the boot time but doesn’t fix the issues introduced in 3.0.4.

Which Framework Desktop model are you using? (AMD Ryzen™ AI Max 300 Series)

AMD Ryzen AI Max 395+


Reporting an ongoing ring timeout pattern on Framework Desktop (Strix Halo, gfx1151) running Kubuntu 26.04 LTS with kernel 7.0.0-15.15. The system recovers cleanly via gpu_recovery=1 → MODE2 reset — no fabric flood, no hard hang, no power-button reset. But the timeouts are happening under unremarkable GL workload (today’s instance: Slack/Electron, just a chat app), and the canonical mitigation flags from prior forum threads either do nothing on this kernel build or appear to be misapplied.

Posting in the spirit of “here’s what’s happening with full data” rather than as a help request, because some of what I found contradicts advice circulating in older threads and may save other Strix Halo owners cycles.

Caveat emptor: I am operating above my pay grade, out of my depth, any other euphemism available to describe that while I am capable of figuring these things out I am NOT paid to do this on a daily basis and, as such, please review these findings and recommendations through that lens.

Hardware and software

  • Machine: Framework Desktop, Strix Halo (Ryzen AI Max 395+), gfx1151
  • OS: Kubuntu 26.04 LTS (in-place upgrade from 24.04)
  • Kernel: 7.0.0-15.15 (stock Ubuntu)
  • GPU PCIe ID: 0000:c2:00.0
  • BIOS: 3.0.3 (rolled back from 3.0.4/3.0.5 — see header)
  • Mitigation flag set in /etc/default/grub:
    GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.dcdebugmask=0x10 amdgpu.gpu_recovery=1 amdgpu.mes=0 pcie_ports=native pci=ecrc=on"
    

What happened

On May 1, 2026 at 16:54:23 EDT, the desktop froze briefly, screen went black, then recovered. KWin notification: “Desktop effects were restarted due to a graphics reset.” Slack (the only foreground GL client) crashed and dumped core. No reboot, no other applications affected, desktop fully usable within seconds.

Under the surface, this is the kernel sequence — clean ring timeout cascade, MES-based reset failed, MODE2 reset succeeded:

May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0: Dumping IP State
May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0: Dumping IP State Completed
May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0: [drm] AMDGPU device coredump file has been created
May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0: ring gfx_0.0.0 timeout, signaled seq=17436214, emitted seq=17436217
May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0:  Process slack pid 17365 thread slack:cs0 pid 17389
May 01 16:54:23 framework kernel: amdgpu 0000:c2:00.0: Starting gfx_0.0.0 ring reset
May 01 16:54:25 framework kernel: amdgpu 0000:c2:00.0: MES failed to respond to msg=RESET
May 01 16:54:25 framework kernel: amdgpu 0000:c2:00.0: failed to reset legacy queue
May 01 16:54:25 framework kernel: amdgpu 0000:c2:00.0: reset via MES failed and try pipe reset -110
May 01 16:54:25 framework kernel: amdgpu 0000:c2:00.0: The CPFW hasn't support pipe reset yet.
May 01 16:54:25 framework kernel: amdgpu 0000:c2:00.0: Ring gfx_0.0.0 reset failed
May 01 16:54:25 framework kernel: amdgpu 0000:c2:00.0: GPU reset begin!. Source:  1
May 01 16:54:27 framework kernel: amdgpu 0000:c2:00.0: MES failed to respond to msg=REMOVE_QUEUE
May 01 16:54:27 framework kernel: amdgpu 0000:c2:00.0: failed to unmap legacy queue
May 01 16:54:28 framework kernel: [drm:gfx_v11_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: MODE2 reset
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: GPU reset succeeded, trying to resume
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: SMU is resuming...
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: SMU is resumed successfully!
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: [drm] DMUB hardware initialized: version=0x09003F00
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: GPU reset(1) succeeded!
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: [drm] device wedged, but recovered through reset
May 01 16:54:28 framework flatpak[17365]: [65:0501/165428.406364:ERROR:ui/gl/gl_fence_android_native_fence_sync.cc:67] eglDupNativeFenceFDANDROID duplication failure. Returned error=-1
May 01 16:54:28 framework kernel: amdgpu 0000:c2:00.0: [drm] *ERROR* Failed to initialize parser -125!

The KWin compositor restarted itself, GL clients lost their contexts. Slack (Electron/Chromium) couldn’t recover its GL fence and SIGTRAP’d. Everything else continued normally.

Trigger

The kernel names the offending process directly:

Process slack pid 17365 thread slack:cs0 pid 17389

This is Slack — the chat app. Not 3D, not video encode, not AI workload, not even active video conference at the time. Just Slack’s normal Chromium GPU process doing whatever it does at idle/low-activity. If a chat client can wedge the gfx ring on this hardware, the threshold for triggering ring timeouts under kernel 7.0’s amdgpu code path on gfx1151 is low enough that essentially any GL workload is a coin flip over time.

I have not yet tested whether disabling hardware acceleration in Slack (ELECTRON_DISABLE_HARDWARE_ACCELERATION=1) prevents recurrence. That’s a userspace workaround, not a fix — but it would confirm or refute the Chromium-as-trigger hypothesis.

What didn’t help: verified non-mitigations on this kernel build

Posting these explicitly because the same flags get recommended in older forum threads and on this kernel/distro combination they are doing nothing.

pci=ecrc=on — silent no-op

After reboot with the flag set, lspci -vvv shows the ECRC bits never flip:

AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-

The hardware advertises ECRC capability but the kernel never enables it. Reason: CONFIG_PCIE_ECRC is not built into Ubuntu’s stock 7.0 kernel (I did nothing to turn this off, this is not a custom kernel, but the capabilities are not there). The cmdline flag is parsed, no code path exists to act on it, kernel moves on silently. This means the flag has done nothing on Ubuntu/Mint kernels for the entire time it’s been in any of our GRUB lines.

If “ECRC fixed it” reports exist for Strix Halo, I believe they’re either on Fedora/Arch (assumption they build ECRC support into their kernels) or they’re coincidence/placebo on Ubuntu derivatives.

pcie_ports=native — paired with the above, also a no-op here

Without CONFIG_PCIE_ECRC, taking AER ownership doesn’t gain anything. The flag itself parses fine, but my understanding is the intended outcome (kernel-managed AER + active ECRC) requires both halves and we only get one.

amdgpu.mes=0 — already the default, so the flag does nothing

This is the verifiable one. mes defaults to 0 on this kernel build. Setting amdgpu.mes=0 from the kernel command line is setting a parameter to its value, which is its default. It changes nothing.

$ cat /sys/module/amdgpu/parameters/mes
0

$ modinfo amdgpu | grep '^parm:.*mes'
parm:           mes:Enable Micro Engine Scheduler (0 = disabled (default), 1 = enabled) (int)
parm:           mes_log_enable:Enable Micro Engine Scheduler log (0 = disabled (default), 1 = enabled) (int)
parm:           mes_kiq:Enable Micro Engine Scheduler KIQ (0 = disabled (default), 1 = enabled) (int)
parm:           uni_mes:Enable Unified Micro Engine Scheduler (0 = disabled, 1 = enabled(default) (int)

So amdgpu.mes=0 is a no-op on this hardware/kernel combination. If anyone reported it “fixed” something, it was either coincidence or some other change made at the same time.

What I don’t know is which code path is producing the MES failed to respond to msg=RESET and msg=REMOVE_QUEUE lines in the trace above. There are four MES-related parameters listed by modinfo (mes, mes_log_enable, mes_kiq, uni_mes), and uni_mes is the only one whose modinfo string says default=enabled — though I haven’t read the kernel source carefully enough to confirm that’s actually true at runtime on Strix Halo. The upstream documentation describes MES generically as a microcontroller for queue scheduling; it doesn’t specify which parameter controls the active path on this hardware.

If someone with deeper kernel knowledge can confirm or refute whether uni_mes is what’s getting invoked here, that would be useful for narrowing the search. The next experiment in my queue is to test amdgpu.uni_mes=0 on a deliberate basis (with rollback prepared, since this would be a real architectural change to GPU scheduling, not a free try). If anyone has already tried that on Strix Halo, I’d appreciate hearing the result before I do.

What’s working

  • amdgpu.gpu_recovery=1 — confirmed working. The MODE2 reset completed successfully and recovered the GPU. Without this, today’s event would have been a hard freeze.
  • Persistent journals — keeping /var/log/journal/ populated across reboots was the difference between “I think the GPU froze” and “here are the 30 lines of dmesg that name the offending process and ring.” Strongly recommend any Strix Halo user on Linux turn this on (mkdir -p /var/log/journal && systemd-tmpfiles --create --prefix /var/log/journal).

Frequency

This has happened a good bit. The reboots below weren’t all planned and there are a couple prior to this not in the list. I’d say the GPU issue has happened as frequently as every couple hours under normal use and at least daily under casual. Video seems to exacerbate it but, as illustrated, is NOT the cause of it.

Boot timeline (last 5 boots)

-4 8c9ae0890a3a43cb8b8107b87a850394 Wed 2026-04-29 10:21:49 EDT Thu 2026-04-30 00:44:15 EDT
-3 449881d11f8e4d54ab4e6daa62a60525 Thu 2026-04-30 00:50:50 EDT Thu 2026-04-30 01:28:22 EDT
-2 ad4a296eee4641079302d40eeecffa76 Thu 2026-04-30 01:28:56 EDT Thu 2026-04-30 17:32:39 EDT
-1 77475da3163e417994cebb092d63a4f0 Thu 2026-04-30 17:39:13 EDT Thu 2026-04-30 18:08:56 EDT
 0 b9302c8a9a7d46648d3e53a85ffb299d Thu 2026-04-30 18:09:17 EDT Fri 2026-05-01 17:01:56 EDT

Today’s event happened during boot 0, mid-session. The system is still up.

Questions for the community

  1. Which MES-related parameter (mes, mes_kiq, uni_mes) actually controls the scheduler code path that produces “MES failed to respond” messages on Strix Halo? I can confirm mes=0 is the default and changing it does nothing — but I haven’t been able to source which parameter is actually live on this hardware.
  2. Has anyone tested amdgpu.uni_mes=0 on Strix Halo? Does the GPU still initialize, and does it change the ring timeout pattern?
  3. For people running newer-than-stock kernels (mainline, 6.18+ HWE-edge, custom builds with CONFIG_PCIE_ECRC=y): does the ring timeout pattern under mundane GL load improve, or is this a pure amdgpu code path issue not addressed by the PCIe-error mitigations?
  4. For Framework support specifically: is there a recommended kernel/firmware combination for gfx1151 that the team knows is stable today on Ubuntu LTS, or are we in “wait for upstream” territory until the kernel-side amdgpu work for Strix Halo settles?