SMU deadlock / system freeze on Fedora 43

Best option is either try 7.0 (I have not had problems yet after hours of gaming) […]

At least on Fedora 43/44 based systems that does not work (for me). I can replicate the bug on kernel 7.0.0 within 5 minutes. A downgrade to a 25.0.x version of mesa also did not help.

I was having this issue multiple times a day until I switched from brave (chromium) to firefox. This was always being triggered by chrome, so as a short term workaround you can try the same thing until they figure out the root cause.

1 Like

How is it going for you? CachyOS still crash occasionally for me on firmware 20251111.

CachyOS with 7.0.1 and firefox instead of chrome ist now stable for me.

SMU deadlock investigation — VPE idle power gating as root cause

My system: Framework Desktop, AMD Ryzen AI Max 300 (Strix Halo, gfx1151), BIOS 03.04, PMFW 100.6.0, Kernel 7.0.1 (CachyOS), Wayland/KDE Plasma

What I found
Using `dynamic_debug` tracing on `smu_cmn.c`, I traced the exact SMU message sequence leading to the freeze.

The root cause is VPE (Video Processing Engine) idle power gating: the `PowerDownVpe` / `PowerUpVpe` SMU message cycling triggered by browser VAAPI hardware video decode (Brave/Chrome/Firefox) corrupts the PMFW 100.6.0 internal state, leading to `resp_reg` stuck at 0 and the familiar cascade.

Key observations:

  • VPE is always involved — every crash includes PowerDownVpe (msg 0x32). VCN alone cycles fine.
  • Timing doesn’t matter I tested 3ms, 60ms, 200ms gaps between SMU messages. All crash.
  • A single cycle seems enough one PowerDown->PowerUp VPE can corrupt the firmware state.

What I’m testing
I’m running a mitigation that disables VPE idle power gating: VPE stays powered during normal use, eliminating the PowerDownVpe/PowerUpVpe cycling entirely. Suspend/resume is not affected (hw_fini/hw_init handle that separately). Power cost is ~0.5-1W idle.

So far: 24+ hours stable with HW video decode enabled, heavy YouTube usage with rapid video switching — a workload that previously crashed within 5-10 minutes. Zero SMU errors, zero GPU resets.

Next steps
I’m still validating under different workloads (compute with llama-server, sleep cycles, multi-monitor). If stability holds I’ll share the kernel patch and full instructions. It’s a 3-line change with a module parameter (`amdgpu.no_vpe_idle_pg=1`) that can be enabled via kernel command line.

Stay tuned.

No problems for me. I downgraded the kernel and mesa too.

Workaround for SMU deadlock / GPU freeze on Strix Halo — disable VPE idle power gating

TL;DR: A 3-line kernel patch adds amdgpu.no_vpe_idle_pg=1 module parameter that prevents VPE (Video Processing Engine) from cycling power during normal use. This eliminates the SMU deadlock that causes hard freezes during browser hardware video decode. 48+ hours stable with YouTube HW decode, where previously it crashed within 5-10 minutes.

System

  • Framework Desktop, AMD Ryzen AI Max 300 Series (Strix Halo, gfx1151)

  • BIOS INSYDE 03.04, PMFW 100.6.0

  • Kernel 7.0.1 (CachyOS), Wayland/KDE Plasma

  • Brave/Chrome with VAAPI hardware video decode enabled

The problem

When browsers use VAAPI hardware video decode (enabled by default since Chromium 143 / Brave 1.85 on Wayland), the amdgpu driver rapidly cycles VPE power state (PowerDownVpe / PowerUpVpe) via SMU messages every time a video starts, stops, or changes. The PMFW 100.6.0 firmware cannot handle this cycling — even a single PowerDown→PowerUp cycle can leave the SMU in a corrupted state where resp_reg gets stuck at 0. A few seconds later, the next SMU message times out, cascading into:


SMU: No response msg_reg: 32 resp_reg: 0

Failed to power gate VPE!

Failed to disable gfxoff!

ring gfx_0.0.0 timeout

GPU reset begin!

Followed by hard freeze requiring power cycle.

Root cause analysis

Using dynamic_debug tracing on smu_cmn.c, I captured the exact SMU message sequence before crashes. Key findings:

  1. VPE is always involved — every crash includes PowerDownVpe (msg 0x32). VCN alone (PowerDownVcn0/Vcn1) cycles fine without crashes.

  2. Timing doesn’t matter — tested settlement delays of 3ms (stock), 60ms, and 200ms between consecutive SMU messages. All crash. The bug is not about messages arriving “too fast.”

  3. A single cycle is enough — one PowerDownVpe followed by one PowerUpVpe can corrupt the firmware state. It doesn’t require accumulation.

  4. VCN cycling without VPE is stable — with VPE idle power gating disabled, VCN0 and VCN1 cycle freely (40+ transitions in 2 minutes) with zero errors.

The fix

The patch adds a module parameter amdgpu.no_vpe_idle_pg. When set to 1, vpe_ring_end_use() skips scheduling the idle work handler, so VPE stays powered after its first use. Suspend/resume is NOT affected — hw_fini / hw_init handle that path separately.

Power cost: ~0.5-1W idle (VPE block stays clocked). Negligible on a desktop system.

Patch (applies to kernel 7.0.x, should apply to 6.18+ with minor fuzz):


--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

@@ -248,6 +248,7 @@

int amdgpu_umsch_mm_fwlog;

int amdgpu_rebar = -1; /* auto */

int amdgpu_user_queue = -1;

+int amdgpu_no_vpe_idle_pg;

uint amdgpu_hdmi_hpd_debounce_delay_ms;

@@ -424,6 +425,20 @@

module_param_named_unsafe(ip_block_mask, amdgpu_ip_block_mask, uint, 0444);

/**

+ * DOC: no_vpe_idle_pg (int)

+ * Disable VPE (Video Processing Engine) idle power gating (1 = VPE stays

+ * powered during normal use, 0 = normal power gating). Workaround for AMD

+ * Strix Halo PMFW 100.6.0 where PowerDownVpe/PowerUpVpe cycling causes an

+ * SMU deadlock during browser hardware video decode. Suspend/resume is not

+ * affected - hw_fini/hw_init handle that path separately. The default is 0

+ * (normal power gating behavior).

+ */

+MODULE_PARM_DESC(no_vpe_idle_pg,

+ "Disable VPE idle power gating (1 = skip, 0 = normal). "

+ "Workaround for Strix Halo PMFW 100.6.0 SMU deadlock (default: 0)");

+module_param_named(no_vpe_idle_pg, amdgpu_no_vpe_idle_pg, int, 0444);

+

+/**

* DOC: bapm (int)

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

@@ -269,6 +269,7 @@

extern int amdgpu_wbrf;

extern int amdgpu_user_queue;

+extern int amdgpu_no_vpe_idle_pg;

extern uint amdgpu_hdmi_hpd_debounce_delay_ms;

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c

@@ -896,7 +896,8 @@

{

struct amdgpu_device *adev = ring->adev;

- schedule_delayed_work(&adev->vpe.idle_work, VPE_IDLE_TIMEOUT);

+ if (!amdgpu_no_vpe_idle_pg)

+ schedule_delayed_work(&adev->vpe.idle_work, VPE_IDLE_TIMEOUT);

}

How to use (without rebuilding kernel)

If your distro provides a way to patch the kernel (DKMS, out-of-tree module rebuild, or custom kernel), apply the patch above and boot with:


amdgpu.no_vpe_idle_pg=1

Or in /etc/modprobe.d/amdgpu-vpe.conf:


options amdgpu no_vpe_idle_pg=1

How to verify it’s working


# Enable SMU message tracing

echo 'module amdgpu file smu_cmn.c +p' | sudo tee /sys/kernel/debug/dynamic_debug/control

# Watch power messages (open/close YouTube videos in browser)

journalctl -kf -o short-precise | grep -iE "PowerUp|PowerDown"

Expected: PowerUpVcn0, PowerDownVcn0, PowerUpVcn1, PowerDownVcn1 cycling normally. No PowerUpVpe or PowerDownVpe messages after initial boot.

Request

If you have a Framework Desktop (Strix Halo) experiencing SMU deadlock / GPU freezes, please test this patch and report results. With 3-5 confirmations from different users I’ll submit it upstream to the amd-gfx mailing list.

2 Likes

Hey Domenico_Crupi - Appreciate the post and patch, I’ve opened a BZ w/fedora and have been banging my head against this for a while. I’m compiling your patch into a f44 (6.19.13-300.fc44.x86_64) kernel at the moment. BZ is open here 2457514 – 6.19.11 on a Ryzen AI Max+ 395 (Strix Halo, gfx1151) running Fedora 43, the system locks up hard within a few hours of any GPU workload.

I initially ran into this while running some large llm workloads and browsers - so wasn’t sure which was causing the breakdown. Claude code diags look like they were barking up the wrong tree.

Adding another data point — same SMU deadlock symptoms (Fedora 42 / kernel 6.16.8)

Hi all — contributing another data point. I’m hitting what looks like the exact same issue on a Framework Desktop. Two hard freezes today within ~30 minutes, both requiring a forced power-off.


Hardware

  • Machine: Framework Desktop, AMD Ryzen AI MAX+ 395 (Strix Halo)

  • GPU: Radeon 8060S (gfx1151)

  • RAM: 128 GB unified

  • Storage: NVMe


Software

  • OS: Fedora 42

  • Kernel: 6.16.8-200.fc42.x86_64

  • Desktop: GNOME (Wayland)

  • Browser at time of both freezes: Brave (Chromium-based)

This thread mostly references Fedora 43 / kernel 6.17+, so adding a slightly older data point to help narrow the regression window. Symptoms appear identical.


Symptoms

  • Full system freeze

  • No mouse or keyboard input

  • SSH unavailable

  • Bluetooth devices disconnect immediately at freeze

  • No recovery after several minutes

  • Forced power-off required

  • No kernel panic or shutdown logs


Evidence from journalctl

Both crashes show the journal cutting off mid-stream. On next boot:

systemd-journald[859]: File /var/log/journal/.../user-1000.journal corrupted or
uncleanly shut down, renaming and replacing.

No evidence of:

  • kernel oops

  • MCE

  • amdgpu GPU hang

  • OOM

  • soft-lockup or hung-task warnings

  • coredumps

Logging stops entirely, consistent with the SMU deadlock pattern described in-thread.


Browser logs prior to crash

Brave was emitting this repeatedly (every few seconds):

brave-browser.desktop[4305]: [4347:4472:.../display.cc:275]
ERROR: Frame latency is negative: -0.067 ms


Crash timeline (EDT)

# Last journal entry Recovery (forced power-off → boot)
1 21:29:32 21:32:17
2 21:58:43 22:00:12

Between crashes: normal browsing with multiple tabs, some with video / autoplay thumbnails (consistent with VPE idle PG trigger discussed earlier).


Current kernel cmdline

BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.16.8-200.fc42.x86_64 root=UUID=... ro
rootflags=subvol=root rhgb quiet amdgpu.gfxoff=0

Notes:

  1. amdgpu.gfxoff=0
    Kernel logs:
amdgpu: unknown parameter 'gfxoff' ignored

This confirms:

  • not a valid exposed module parameter

  • effectively a no-op

  • not relevant to this issue

  1. Not currently using amdgpu.gpu_recovery=1
    Likely why freezes result in hard hangs rather than recovery behavior others have observed.

Next steps

  • Add amdgpu.gpu_recovery=1 to allow recovery instead of hard freeze

  • Switch primary browsing to Firefox as workaround

  • Defer amdgpu.no_vpe_idle_pg=1 until upstream / Fedora availability

  • Stay on current kernel for controlled comparison


If additional logs would help (full journalctl -b -1, dmesg, BIOS version, etc.), I can provide them. Also open to testing patched kernels if there’s a recommended build or COPR.

Appreciate the work digging into the dcn35_smu / VPE idle PG root cause.

@Myles_Baker You’re running on a known broken setup. You should upgrade to a kernel version 6.18.4+ and ensure that linux-firmware is at 20260110 or newer. It could be that your versions are old enough to not include the broken linux-firmware & kernel but I’d upgrade either way.

Thanks for sharing the patch and methodology — the dynamic_debug trace approach mirrors how I got my own investigation moving on the same silicon.

One observation worth flagging: this likely addresses one trigger path, but the same end-state seems reachable via others. I’ve been chasing a similar silent hang on Framework Desktop / PMFW 100.6.0 under sustained KFD compute (vLLM / llama.cpp inference) on a headless server — no display server, no browser, no video decode. With pr_debug on smu_cmn.c I see VPE messages only at amdgpu probe init (zero runtime cycling during workload), and resp_reg stays at PPSMC_Result_OK right up to the freeze. So the trigger you’ve identified can’t apply on my path, but I still wedge with a near-identical end-state.

The PMFW bug looks reachable from multiple trigger paths. Your patch should cleanly fix the display/video class; the compute class likely needs a separate firmware-side fix from AMD. Compute-side details tracked at ROCm/ROCm#6165.

1 Like

Update: Crupi’s no_vpe_idle_pg patch (6.19.13-300.vpe1.fc44,amdgpu.no_vpe_idle_pg=1)
does not fully fix this bug.

Reproduced 2026-04-27 22:40 EDT under pure inference load — no VAAPI, no
Electron,
no concurrent GPU clients. vLLM (TheRock ROCm image, Gemma 4 31B AWQ-INT4) hit
HW Exception on a compute queue (comp_1.1.1) ~23s after model load completed.
Cascade:

HW Exception by GPU node-1 reason :GPU Hang                                 
[drm] *ERROR* Failed to initialize parser -125                
ring vpe test failed (-110)             # post-reset, collateral            
Dpm disable jpeg failed, ret = -62                                          
Failed to power gate VCN instance 0 / 1                                     
SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000021      
Failed to retrieve enabled ppfeatures   # repeating every 5s                

Notes:

  • SMU stuck-command code is 0x21 on the patched kernel (was 0x32 unpatched).
    Same wedge state, different message ID.
  • No “Failed to power gate VPE!” in the cascade — consistent with VPE idle-pg
    being gated off by the patch. Wedge entered via JPEG/VCN power-gate path.
  • Conclusion: at least two triggers reach the same SMU mailbox wedge:
    (1) VAAPI → VPE idle power-gate [closed by Crupi patch]
    (2) Inference HW exception → MES REMOVE_QUEUE fail → MODE2 reset
    → JPEG/VCN power-gate fail → SMU wedge [still open]
  • Clean shutdown hung ~8 hours after wedge; manual power-cycle required.
    Same “hard reset only” recovery as 0x32.

@Domenico_Crupi your fix works! Will report here if problem arises again.

A bit of trivia: Nixos unstable, triggered by videos in Brave. Qutebrowser never triggered this problem, looks like not using hardware acceleration at all. Until today system just silently hanged with no traces in journalctl -b 1, today I got:

May 02 18:00:35.531201 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
May 02 18:00:35.531691 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to power gate VPE!
May 02 18:00:35.531875 vglfr kernel: [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
May 02 18:00:40.824245 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
May 02 18:00:40.824704 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to export SMU metrics table!
May 02 18:00:41.021179 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
May 02 18:00:46.326174 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
May 02 18:00:46.334965 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to disable gfxoff!

The boot before this one also hanged on video but there are no amd related errors at all. Last amd message was on bootup.

The problem appeared somewhere in late December 2025 despite heavy usage of Brave since the purchase (early October).

Confirming Crupi’s no_vpe_idle_pg patch on a Framework Desktop running Arch.

Hardware: Framework Desktop, AMD Ryzen AI Max 300 (Strix Halo), Radeon 8060S iGPU (dcn35, VCN 4.0.5), 128 GB RAM, BIOS 0.0.3.4.

Failure history before the patch: four freezes in ~4 days, same exact signature each time —

amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032
amdgpu: Failed to power gate VPE!
amdgpu: vpe_set_powergating_state ... ret = -62

…then VCN power-ungate fail and PSP gfx command fail. Timeline: 2026-04-25 16:46, 2026-04-26 21:04, 2026-04-26 21:49, 2026-04-29 19:10. Both linux-lts (6.18.23, 6.18.24) and mainline linux (6.19.14) hit it.

Defenses that did NOT prevent recurrence:

  • amdgpu.dcdebugmask=0x10
  • BIOS 0.0.3.3 → 0.0.3.4
  • Kernel swap (mainline 6.19 ↔ lts 6.18)

Best time-to-fail on linux-lts + dcdebugmask was 23h 22min.

Fix applied 2026-04-30:

  • Built linux-lts 6.18.25 with the 3-line no_vpe_idle_pg patch as a parallel package (linux-lts-strix).
  • Added amdgpu.no_vpe_idle_pg=1 to GRUB cmdline (kept dcdebugmask=0x10 alongside, belt-and-suspenders).
  • Switched default browser from Chromium to Firefox to sidestep the VAAPI trigger.

Result so far:

  • Current uptime on the patched kernel: 1d 22h 31min with zero SMU/dcn35/VPE signature hits in the journal.
  • Previous boot on the same patched kernel ran clean for ~2.7 days before a normal shutdown — also zero hits.
  • That puts cumulative clean soak at roughly 4.5 days across two boots, against a prior baseline of a freeze every 23h or sooner.

Soak is still short relative to the failure rate, but the fact that two consecutive boots on the patched kernel hit zero signature events is a strong result when every prior config produced them within a day. I will report back if anything regresses.

Thanks for tracking this down.