SMU deadlock / system freeze on Fedora 43

CF-CloudSeeds · April 22, 2026, 9:37am

Best option is either try 7.0 (I have not had problems yet after hours of gaming) […]

At least on Fedora 43/44 based systems that does not work (for me). I can replicate the bug on kernel 7.0.0 within 5 minutes. A downgrade to a 25.0.x version of mesa also did not help.

Ashley_Penney1 · April 24, 2026, 9:14pm

I was having this issue multiple times a day until I switched from brave (chromium) to firefox. This was always being triggered by chrome, so as a short term workaround you can try the same thing until they figure out the root cause.

Dang_H · April 25, 2026, 5:00pm

How is it going for you? CachyOS still crash occasionally for me on firmware 20251111.

rwaffen · April 25, 2026, 9:32pm

CachyOS with 7.0.1 and firefox instead of chrome ist now stable for me.

Domenico_Crupi · April 26, 2026, 3:01pm

SMU deadlock investigation — VPE idle power gating as root cause

My system: Framework Desktop, AMD Ryzen AI Max 300 (Strix Halo, gfx1151), BIOS 03.04, PMFW 100.6.0, Kernel 7.0.1 (CachyOS), Wayland/KDE Plasma

What I found
Using `dynamic_debug` tracing on `smu_cmn.c`, I traced the exact SMU message sequence leading to the freeze.

The root cause is VPE (Video Processing Engine) idle power gating: the `PowerDownVpe` / `PowerUpVpe` SMU message cycling triggered by browser VAAPI hardware video decode (Brave/Chrome/Firefox) corrupts the PMFW 100.6.0 internal state, leading to `resp_reg` stuck at 0 and the familiar cascade.

Key observations:

VPE is always involved — every crash includes PowerDownVpe (msg 0x32). VCN alone cycles fine.
Timing doesn’t matter I tested 3ms, 60ms, 200ms gaps between SMU messages. All crash.
A single cycle seems enough one PowerDown->PowerUp VPE can corrupt the firmware state.

What I’m testing
I’m running a mitigation that disables VPE idle power gating: VPE stays powered during normal use, eliminating the PowerDownVpe/PowerUpVpe cycling entirely. Suspend/resume is not affected (hw_fini/hw_init handle that separately). Power cost is ~0.5-1W idle.

So far: 24+ hours stable with HW video decode enabled, heavy YouTube usage with rapid video switching — a workload that previously crashed within 5-10 minutes. Zero SMU errors, zero GPU resets.

Next steps
I’m still validating under different workloads (compute with llama-server, sleep cycles, multi-monitor). If stability holds I’ll share the kernel patch and full instructions. It’s a 3-line change with a module parameter (`amdgpu.no_vpe_idle_pg=1`) that can be enabled via kernel command line.

Stay tuned.

Cowington_Post · April 26, 2026, 7:15pm

No problems for me. I downgraded the kernel and mesa too.

Domenico_Crupi · April 27, 2026, 7:12am

Workaround for SMU deadlock / GPU freeze on Strix Halo — disable VPE idle power gating

TL;DR: A 3-line kernel patch adds amdgpu.no_vpe_idle_pg=1 module parameter that prevents VPE (Video Processing Engine) from cycling power during normal use. This eliminates the SMU deadlock that causes hard freezes during browser hardware video decode. 48+ hours stable with YouTube HW decode, where previously it crashed within 5-10 minutes.

System

Framework Desktop, AMD Ryzen AI Max 300 Series (Strix Halo, gfx1151)
BIOS INSYDE 03.04, PMFW 100.6.0
Kernel 7.0.1 (CachyOS), Wayland/KDE Plasma
Brave/Chrome with VAAPI hardware video decode enabled

The problem

When browsers use VAAPI hardware video decode (enabled by default since Chromium 143 / Brave 1.85 on Wayland), the amdgpu driver rapidly cycles VPE power state (PowerDownVpe / PowerUpVpe) via SMU messages every time a video starts, stops, or changes. The PMFW 100.6.0 firmware cannot handle this cycling — even a single PowerDown→PowerUp cycle can leave the SMU in a corrupted state where resp_reg gets stuck at 0. A few seconds later, the next SMU message times out, cascading into:


SMU: No response msg_reg: 32 resp_reg: 0

Failed to power gate VPE!

Failed to disable gfxoff!

ring gfx_0.0.0 timeout

GPU reset begin!

Followed by hard freeze requiring power cycle.

Root cause analysis

Using dynamic_debug tracing on smu_cmn.c, I captured the exact SMU message sequence before crashes. Key findings:

VPE is always involved — every crash includes PowerDownVpe (msg 0x32). VCN alone (PowerDownVcn0/Vcn1) cycles fine without crashes.
Timing doesn’t matter — tested settlement delays of 3ms (stock), 60ms, and 200ms between consecutive SMU messages. All crash. The bug is not about messages arriving “too fast.”
A single cycle is enough — one PowerDownVpe followed by one PowerUpVpe can corrupt the firmware state. It doesn’t require accumulation.
VCN cycling without VPE is stable — with VPE idle power gating disabled, VCN0 and VCN1 cycle freely (40+ transitions in 2 minutes) with zero errors.

The fix

The patch adds a module parameter amdgpu.no_vpe_idle_pg. When set to 1, vpe_ring_end_use() skips scheduling the idle work handler, so VPE stays powered after its first use. Suspend/resume is NOT affected — hw_fini / hw_init handle that path separately.

Power cost: ~0.5-1W idle (VPE block stays clocked). Negligible on a desktop system.

Patch (applies to kernel 7.0.x, should apply to 6.18+ with minor fuzz):


--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

@@ -248,6 +248,7 @@

int amdgpu_umsch_mm_fwlog;

int amdgpu_rebar = -1; /* auto */

int amdgpu_user_queue = -1;

+int amdgpu_no_vpe_idle_pg;

uint amdgpu_hdmi_hpd_debounce_delay_ms;

@@ -424,6 +425,20 @@

module_param_named_unsafe(ip_block_mask, amdgpu_ip_block_mask, uint, 0444);

/**

+ * DOC: no_vpe_idle_pg (int)

+ * Disable VPE (Video Processing Engine) idle power gating (1 = VPE stays

+ * powered during normal use, 0 = normal power gating). Workaround for AMD

+ * Strix Halo PMFW 100.6.0 where PowerDownVpe/PowerUpVpe cycling causes an

+ * SMU deadlock during browser hardware video decode. Suspend/resume is not

+ * affected - hw_fini/hw_init handle that path separately. The default is 0

+ * (normal power gating behavior).

+ */

+MODULE_PARM_DESC(no_vpe_idle_pg,

+ "Disable VPE idle power gating (1 = skip, 0 = normal). "

+ "Workaround for Strix Halo PMFW 100.6.0 SMU deadlock (default: 0)");

+module_param_named(no_vpe_idle_pg, amdgpu_no_vpe_idle_pg, int, 0444);

+

+/**

* DOC: bapm (int)

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

@@ -269,6 +269,7 @@

extern int amdgpu_wbrf;

extern int amdgpu_user_queue;

+extern int amdgpu_no_vpe_idle_pg;

extern uint amdgpu_hdmi_hpd_debounce_delay_ms;

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c

@@ -896,7 +896,8 @@

{

struct amdgpu_device *adev = ring->adev;

- schedule_delayed_work(&adev->vpe.idle_work, VPE_IDLE_TIMEOUT);

+ if (!amdgpu_no_vpe_idle_pg)

+ schedule_delayed_work(&adev->vpe.idle_work, VPE_IDLE_TIMEOUT);

}

How to use (without rebuilding kernel)

If your distro provides a way to patch the kernel (DKMS, out-of-tree module rebuild, or custom kernel), apply the patch above and boot with:


amdgpu.no_vpe_idle_pg=1

Or in /etc/modprobe.d/amdgpu-vpe.conf:


options amdgpu no_vpe_idle_pg=1

How to verify it’s working


# Enable SMU message tracing

echo 'module amdgpu file smu_cmn.c +p' | sudo tee /sys/kernel/debug/dynamic_debug/control

# Watch power messages (open/close YouTube videos in browser)

journalctl -kf -o short-precise | grep -iE "PowerUp|PowerDown"

Expected: PowerUpVcn0, PowerDownVcn0, PowerUpVcn1, PowerDownVcn1 cycling normally. No PowerUpVpe or PowerDownVpe messages after initial boot.

Request

If you have a Framework Desktop (Strix Halo) experiencing SMU deadlock / GPU freezes, please test this patch and report results. With 3-5 confirmations from different users I’ll submit it upstream to the amd-gfx mailing list.

Adam_Clater · April 28, 2026, 1:45am

Hey Domenico_Crupi - Appreciate the post and patch, I’ve opened a BZ w/fedora and have been banging my head against this for a while. I’m compiling your patch into a f44 (6.19.13-300.fc44.x86_64) kernel at the moment. BZ is open here 2457514 – 6.19.11 on a Ryzen AI Max+ 395 (Strix Halo, gfx1151) running Fedora 43, the system locks up hard within a few hours of any GPU workload.

I initially ran into this while running some large llm workloads and browsers - so wasn’t sure which was causing the breakdown. Claude code diags look like they were barking up the wrong tree.

Myles_Baker · April 28, 2026, 2:42am

Adding another data point — same SMU deadlock symptoms (Fedora 42 / kernel 6.16.8)

Hi all — contributing another data point. I’m hitting what looks like the exact same issue on a Framework Desktop. Two hard freezes today within ~30 minutes, both requiring a forced power-off.

Hardware

Machine: Framework Desktop, AMD Ryzen AI MAX+ 395 (Strix Halo)
GPU: Radeon 8060S (gfx1151)
RAM: 128 GB unified
Storage: NVMe

Software

OS: Fedora 42
Kernel: 6.16.8-200.fc42.x86_64
Desktop: GNOME (Wayland)
Browser at time of both freezes: Brave (Chromium-based)

This thread mostly references Fedora 43 / kernel 6.17+, so adding a slightly older data point to help narrow the regression window. Symptoms appear identical.

Symptoms

Full system freeze
No mouse or keyboard input
SSH unavailable
Bluetooth devices disconnect immediately at freeze
No recovery after several minutes
Forced power-off required
No kernel panic or shutdown logs

Evidence from journalctl

Both crashes show the journal cutting off mid-stream. On next boot:

systemd-journald[859]: File /var/log/journal/.../user-1000.journal corrupted or
uncleanly shut down, renaming and replacing.

No evidence of:

kernel oops
MCE
amdgpu GPU hang
OOM
soft-lockup or hung-task warnings
coredumps

Logging stops entirely, consistent with the SMU deadlock pattern described in-thread.

Browser logs prior to crash

Brave was emitting this repeatedly (every few seconds):

brave-browser.desktop[4305]: [4347:4472:.../display.cc:275]
ERROR: Frame latency is negative: -0.067 ms

Crash timeline (EDT)

#	Last journal entry	Recovery (forced power-off → boot)
1	21:29:32	21:32:17
2	21:58:43	22:00:12

Between crashes: normal browsing with multiple tabs, some with video / autoplay thumbnails (consistent with VPE idle PG trigger discussed earlier).

Current kernel cmdline

BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.16.8-200.fc42.x86_64 root=UUID=... ro
rootflags=subvol=root rhgb quiet amdgpu.gfxoff=0

Notes:

amdgpu.gfxoff=0
Kernel logs:

amdgpu: unknown parameter 'gfxoff' ignored

This confirms:

not a valid exposed module parameter
effectively a no-op
not relevant to this issue

Not currently using amdgpu.gpu_recovery=1
Likely why freezes result in hard hangs rather than recovery behavior others have observed.

Next steps

Add amdgpu.gpu_recovery=1 to allow recovery instead of hard freeze
Switch primary browsing to Firefox as workaround
Defer amdgpu.no_vpe_idle_pg=1 until upstream / Fedora availability
Stay on current kernel for controlled comparison

If additional logs would help (full journalctl -b -1, dmesg, BIOS version, etc.), I can provide them. Also open to testing patched kernels if there’s a recommended build or COPR.

Appreciate the work digging into the dcn35_smu / VPE idle PG root cause.

Lafunamor · April 28, 2026, 7:43am

@Myles_Baker You’re running on a known broken setup. You should upgrade to a kernel version 6.18.4+ and ensure that linux-firmware is at 20260110 or newer. It could be that your versions are old enough to not include the broken linux-firmware & kernel but I’d upgrade either way.

Lafunamor · April 28, 2026, 7:54am

Thanks for sharing the patch and methodology — the dynamic_debug trace approach mirrors how I got my own investigation moving on the same silicon.

One observation worth flagging: this likely addresses one trigger path, but the same end-state seems reachable via others. I’ve been chasing a similar silent hang on Framework Desktop / PMFW 100.6.0 under sustained KFD compute (vLLM / llama.cpp inference) on a headless server — no display server, no browser, no video decode. With pr_debug on smu_cmn.c I see VPE messages only at amdgpu probe init (zero runtime cycling during workload), and resp_reg stays at PPSMC_Result_OK right up to the freeze. So the trigger you’ve identified can’t apply on my path, but I still wedge with a near-identical end-state.

The PMFW bug looks reachable from multiple trigger paths. Your patch should cleanly fix the display/video class; the compute class likely needs a separate firmware-side fix from AMD. Compute-side details tracked at ROCm/ROCm#6165.

Adam_Clater · April 28, 2026, 10:52am

Update: Crupi’s no_vpe_idle_pg patch (6.19.13-300.vpe1.fc44,amdgpu.no_vpe_idle_pg=1)
does not fully fix this bug.

Reproduced 2026-04-27 22:40 EDT under pure inference load — no VAAPI, no
Electron,
no concurrent GPU clients. vLLM (TheRock ROCm image, Gemma 4 31B AWQ-INT4) hit
HW Exception on a compute queue (comp_1.1.1) ~23s after model load completed.
Cascade:

HW Exception by GPU node-1 reason :GPU Hang                                 
[drm] *ERROR* Failed to initialize parser -125                
ring vpe test failed (-110)             # post-reset, collateral            
Dpm disable jpeg failed, ret = -62                                          
Failed to power gate VCN instance 0 / 1                                     
SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000021      
Failed to retrieve enabled ppfeatures   # repeating every 5s

Notes:

SMU stuck-command code is 0x21 on the patched kernel (was 0x32 unpatched).
Same wedge state, different message ID.
No “Failed to power gate VPE!” in the cascade — consistent with VPE idle-pg
being gated off by the patch. Wedge entered via JPEG/VCN power-gate path.
Conclusion: at least two triggers reach the same SMU mailbox wedge:
(1) VAAPI → VPE idle power-gate [closed by Crupi patch]
(2) Inference HW exception → MES REMOVE_QUEUE fail → MODE2 reset
→ JPEG/VCN power-gate fail → SMU wedge [still open]
Clean shutdown hung ~8 hours after wedge; manual power-cycle required.
Same “hard reset only” recovery as 0x32.

vglfr · May 3, 2026, 12:17am

@Domenico_Crupi your fix works! Will report here if problem arises again.

A bit of trivia: Nixos unstable, triggered by videos in Brave. Qutebrowser never triggered this problem, looks like not using hardware acceleration at all. Until today system just silently hanged with no traces in journalctl -b 1, today I got:

May 02 18:00:35.531201 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
May 02 18:00:35.531691 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to power gate VPE!
May 02 18:00:35.531875 vglfr kernel: [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
May 02 18:00:40.824245 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
May 02 18:00:40.824704 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to export SMU metrics table!
May 02 18:00:41.021179 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
May 02 18:00:46.326174 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
May 02 18:00:46.334965 vglfr kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to disable gfxoff!

The boot before this one also hanged on video but there are no amd related errors at all. Last amd message was on bootup.

The problem appeared somewhere in late December 2025 despite heavy usage of Brave since the purchase (early October).

Craig_Jennings · May 5, 2026, 7:07am

Confirming Crupi’s no_vpe_idle_pg patch on a Framework Desktop running Arch.

Hardware: Framework Desktop, AMD Ryzen AI Max 300 (Strix Halo), Radeon 8060S iGPU (dcn35, VCN 4.0.5), 128 GB RAM, BIOS 0.0.3.4.

Failure history before the patch: four freezes in ~4 days, same exact signature each time —

amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032
amdgpu: Failed to power gate VPE!
amdgpu: vpe_set_powergating_state ... ret = -62

…then VCN power-ungate fail and PSP gfx command fail. Timeline: 2026-04-25 16:46, 2026-04-26 21:04, 2026-04-26 21:49, 2026-04-29 19:10. Both linux-lts (6.18.23, 6.18.24) and mainline linux (6.19.14) hit it.

Defenses that did NOT prevent recurrence:

amdgpu.dcdebugmask=0x10
BIOS 0.0.3.3 → 0.0.3.4
Kernel swap (mainline 6.19 ↔ lts 6.18)

Best time-to-fail on linux-lts + dcdebugmask was 23h 22min.

Fix applied 2026-04-30:

Built linux-lts 6.18.25 with the 3-line no_vpe_idle_pg patch as a parallel package (linux-lts-strix).
Added amdgpu.no_vpe_idle_pg=1 to GRUB cmdline (kept dcdebugmask=0x10 alongside, belt-and-suspenders).
Switched default browser from Chromium to Firefox to sidestep the VAAPI trigger.

Result so far:

Current uptime on the patched kernel: 1d 22h 31min with zero SMU/dcn35/VPE signature hits in the journal.
Previous boot on the same patched kernel ran clean for ~2.7 days before a normal shutdown — also zero hits.
That puts cumulative clean soak at roughly 4.5 days across two boots, against a prior baseline of a freeze every 23h or sooner.

Soak is still short relative to the failure rate, but the fact that two consecutive boots on the patched kernel hit zero signature events is a strong result when every prior config produced them within a day. I will report back if anything regresses.

Thanks for tracking this down.

Topic		Replies	Views
FW16 Screen Freezeing on Linux Linux arch	55	1366	May 3, 2026
FW Desktop locking up with 2 latest kernels in Ubuntu 24.04 Framework Desktop compatibility	17	308	May 3, 2026
Framework Desktop GPU hang / hard freeze when playing video in Chromium (BIOS 03.03, Arch + kernel 6.19) Linux arch	2	222	May 5, 2026
[SOLVED] Amdgpu crashes and artifacts with Mesa 25, kernel 6.13 Linux fedora , arch , opensuse	56	9901	January 26, 2026
DCMUB Error on BIOS 3.05 + Kernel 6.13.1 hit a very nasty AMDGPU bug on Framework Laptop 13 (AMD Ryzen 7 7840U) Linux nixos	17	1580	February 21, 2026