[Framework Desktop] amdgpu SMU deadlock / system freeze on Fedora 43 – gfx1151 / DCN 3.5 (Ryzen AI MAX+ 395)
Hardware & Software
Machine: Framework Desktop
CPU/GPU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (gfx1151)
RAM: 128 GB
OS: Fedora 43
Kernel: 6.19.11-200.fc43.x86_64
Mesa: 25.3.6
amd-gpu-firmware: 20260309
Symptom
The system hard-freezes or triggers a ~2-minute GPU reset (black screen) when using Chromium-based browsers (Microsoft Edge, Google Chrome) on GPU-heavy web pages. Confirmed triggers include OneDrive (hovering over video thumbnails, scrolling) and the Microsoft sign-in page. YouTube does not trigger the issue. Bluetooth also drops during the freeze, consistent with a full GPU bus hang.
With amdgpu.gpu_recovery=1 active, the system can sometimes recover on its own with a MODE2 ASIC reset taking approximately 2–3 minutes. Without it, a hard power-off is required.
Root Cause (identified via dmesg)
This is an SMU (System Management Unit) deadlock in the dcn35_smu_enable_pme_wa function, triggered during display pipe teardown when the GPU attempts to reset. The exact sequence:
Edge/Chrome GPU process triggers a GFX ring workload
The SMU is already busy with a pending command (SMN_C2PMSG_66:0x00000032)
The driver attempts to disable gfxoff to prepare for GPU recovery — the SMU cannot respond
ring gfx_0.0.0 times out; MES also fails to respond to reset requests
A full MODE2 ASIC reset is eventually triggered
Key dmesg output:
amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
amdgpu: Failed to disable gfxoff!
amdgpu: ring gfx_0.0.0 timeout, signaled seq=37512, emitted seq=37514
amdgpu: MES failed to respond to msg=RESET
amdgpu: failed to reset legacy queue
amdgpu: Ring gfx_0.0.0 reset failed
amdgpu: GPU reset begin!
WARNING: dcn35_smu.c:175 at dcn35_smu_send_msg_with_param+0x166/0x190 [amdgpu]
Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
dcn35_smu_enable_pme_wa+0x23/0x60 [amdgpu]
link_set_dpms_off
dcn31_reset_back_end_for_pipe
dcn31_reset_hw_ctx_wrap
dce110_apply_ctx_to_hw
dc_commit_state_no_check
dc_commit_streams
dm_suspend
amdgpu_device_pre_asic_reset
amdgpu: MODE2 reset
amdgpu: GPU reset succeeded
[drm] device wedged, but recovered through reset
Workarounds attempted
Parameter
Effect
amdgpu.runpm=0
No effect on this bug
amdgpu.gpu_recovery=1
Enables MODE2 recovery instead of hard freeze — helpful but slow
amdgpu.dcdebugmask=0x10
No effect
amdgpu.gfxoff=0
No effect — bug is in dcn35_smu_enable_pme_wa, which bypasses this flag
power_dpm_force_performance_level=high
Delays onset but does not prevent the crash
Assessment
This appears to be a kernel driver bug specific to gfx1151 / DCN 3.5 hardware. The dcn35_smu_enable_pme_wa function does not handle a busy SMU gracefully — when the SMU already has a pending command, any subsequent message sent during display teardown deadlocks the entire reset path. A proper fix would need to come from AMD’s kernel team adding a busy-check or timeout-skip in that function.
Framework support had me reset my mainboard to see if that would resolve the problem. I was able to reproduce the problem at will previously, but since the reset, things have been running smoothly.
Following advice from Framework support, I performed a chipset reset on the mainboard. This appeared to help initially — I could no longer reproduce the crash on OneDrive immediately afterward. However, the system crashed again today after opening a new tab in Edge and navigating to a website, confirming the issue is not specific to OneDrive or any particular web content.
Additional workarounds tried and confirmed ineffective:
amdgpu.mes=0 — no effect; MES scheduler is not the root cause
power_dpm_force_performance_level=high — delays onset but does not prevent the crash
Chipset reset — no effect
New finding in latest crash log:
A new error now appears before the ring timeout, suggesting the hang is occurring slightly earlier in the pipeline:
amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
amdgpu: failed to reg_write_reg_wait
The full crash sequence remains the same — dcn35_smu_enable_pme_wa deadlocks the reset path, requiring a full MODE2 ASIC reset (~2 minutes to recover).
This is clearly a kernel driver bug in dcn35_smu.c that no amount of kernel parameters or hardware resets will fix. I have also filed this upstream at Making sure you're not a bot! . If anyone has found a working workaround or has seen a kernel patch addressing dcn35_smu_enable_pme_wa, please reply.
I switched to using Firefox and updated to Kernel 7.0.0.
So far, I haven’t experienced any freezes, but I also haven’t used the PC that much over the past few days.
I had freezing with my GMKTEK 395 AI MAX machine at work but could never get logs because of freezing. I am posting on my home pc a framework desktop. I had a custom notification extension for gnome installed on my work pc and that would freeze my system immediately when a notification from edge would come in when I visited godaddy.com that was the only website that crashed my edge but I am sure other sites or situations would have crashed it also. So it turned out edge wasnt crashing my system, it was this poorly written gnome extension and those can cause a cpu HALT somehow. I dont have logs, the only thing I have is godaddy.com would cpu HALT. Not saying thats the issue here in this thread but I did have freezing I thought was edge and with no logs it was a struggle. System is fedora 43 with Gnome home pc is fedora 43 KDE no problems with my framework at all since i have had it.
I’m on NixOS, I’ve been running into system lockups recently and I’ve tried these kernels: 6.18 LTS, 6.12 LTS, 6.19 and 7. The GUI seems to freeze for 30 seconds to a minute and in that time I can ssh in. Once the monitors all go black I no longer seem to be able to connect with SSH and have to force a reset with holding the power button or unplugging the AC cord.
I do not do much that is extremely GPU intensive, no gaming or LLMs, but I do use chromium for youtube, reddit, duckduckgo, forums and other complicated websites. I do have 4 1440p monitors including a 5120x1440 ultrawide.
Log from last crash on 6.18 LTS:
Apr 19 12:24:07 fwdesktop kernel: cros-ec-dev cros-ec-dev.2.auto: Some logs may have been dropped...
Apr 19 12:24:12 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 19 12:24:12 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VPE!
Apr 19 12:24:12 fwdesktop kernel: [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
Apr 19 12:24:16 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 19 12:24:16 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VCN instance 0!
Apr 19 12:24:16 fwdesktop kernel: [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
Apr 19 12:24:18 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: Dumping IP State
Apr 19 12:24:22 fwdesktop kernel: r8169 0000:bf:00.0 enp191s0: NETDEV WATCHDOG: CPU: 6: transmit queue 0 timed out 5306 ms
Forgot to add, I have the 128GB model and I have tried multiple SSDs with the same result. I have tried on firmware version 0.0.3.3 and 0.0.3.4.
I too got this problem when upgrading to NixOS 25.11. After experimenting with different kernel and firmware combinations I found that it was the Mesa upgrade that triggered the freezes. I’ve been using a 25.05 overlay to pull in the old Mesa 25.0.7 as a workaround.
I’ll try that later today when I get a chance. Just in case it matters can you check what Kernel and linux-firmware version you are using so I can try and replicate your working setup?
My kernel (6.17.9) and firmware (20251111) are still stuck on 25.05 too, as that was the last config I tried when I got it working and I was too sick of rebooting to try the Mesa downgrade isolated with a current kernel.
I’m doing this in my (probably very unidiomatic) system flake:
Seems to be working, thank you so much. If it keeps working, I’ll edit this post with the version of mesa, linux and linux-firmware tomorrow so people on other distros can fix it too.
I had this issue on the FW16, and there is an actual kernel/driver/firmware bug in 6.18 and 6.19 that causes these problems. Best option is either try 7.0 (I have not had problems yet after hours of gaming) or roll back to an earlier version.