SMU deadlock / system freeze on Fedora 43

Workaround for SMU deadlock / GPU freeze on Strix Halo — disable VPE idle power gating

TL;DR: A 3-line kernel patch adds amdgpu.no_vpe_idle_pg=1 module parameter that prevents VPE (Video Processing Engine) from cycling power during normal use. This eliminates the SMU deadlock that causes hard freezes during browser hardware video decode. 48+ hours stable with YouTube HW decode, where previously it crashed within 5-10 minutes.

System

  • Framework Desktop, AMD Ryzen AI Max 300 Series (Strix Halo, gfx1151)

  • BIOS INSYDE 03.04, PMFW 100.6.0

  • Kernel 7.0.1 (CachyOS), Wayland/KDE Plasma

  • Brave/Chrome with VAAPI hardware video decode enabled

The problem

When browsers use VAAPI hardware video decode (enabled by default since Chromium 143 / Brave 1.85 on Wayland), the amdgpu driver rapidly cycles VPE power state (PowerDownVpe / PowerUpVpe) via SMU messages every time a video starts, stops, or changes. The PMFW 100.6.0 firmware cannot handle this cycling — even a single PowerDown→PowerUp cycle can leave the SMU in a corrupted state where resp_reg gets stuck at 0. A few seconds later, the next SMU message times out, cascading into:


SMU: No response msg_reg: 32 resp_reg: 0

Failed to power gate VPE!

Failed to disable gfxoff!

ring gfx_0.0.0 timeout

GPU reset begin!

Followed by hard freeze requiring power cycle.

Root cause analysis

Using dynamic_debug tracing on smu_cmn.c, I captured the exact SMU message sequence before crashes. Key findings:

  1. VPE is always involved — every crash includes PowerDownVpe (msg 0x32). VCN alone (PowerDownVcn0/Vcn1) cycles fine without crashes.

  2. Timing doesn’t matter — tested settlement delays of 3ms (stock), 60ms, and 200ms between consecutive SMU messages. All crash. The bug is not about messages arriving “too fast.”

  3. A single cycle is enough — one PowerDownVpe followed by one PowerUpVpe can corrupt the firmware state. It doesn’t require accumulation.

  4. VCN cycling without VPE is stable — with VPE idle power gating disabled, VCN0 and VCN1 cycle freely (40+ transitions in 2 minutes) with zero errors.

The fix

The patch adds a module parameter amdgpu.no_vpe_idle_pg. When set to 1, vpe_ring_end_use() skips scheduling the idle work handler, so VPE stays powered after its first use. Suspend/resume is NOT affected — hw_fini / hw_init handle that path separately.

Power cost: ~0.5-1W idle (VPE block stays clocked). Negligible on a desktop system.

Patch (applies to kernel 7.0.x, should apply to 6.18+ with minor fuzz):


--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

@@ -248,6 +248,7 @@

int amdgpu_umsch_mm_fwlog;

int amdgpu_rebar = -1; /* auto */

int amdgpu_user_queue = -1;

+int amdgpu_no_vpe_idle_pg;

uint amdgpu_hdmi_hpd_debounce_delay_ms;

@@ -424,6 +425,20 @@

module_param_named_unsafe(ip_block_mask, amdgpu_ip_block_mask, uint, 0444);

/**

+ * DOC: no_vpe_idle_pg (int)

+ * Disable VPE (Video Processing Engine) idle power gating (1 = VPE stays

+ * powered during normal use, 0 = normal power gating). Workaround for AMD

+ * Strix Halo PMFW 100.6.0 where PowerDownVpe/PowerUpVpe cycling causes an

+ * SMU deadlock during browser hardware video decode. Suspend/resume is not

+ * affected - hw_fini/hw_init handle that path separately. The default is 0

+ * (normal power gating behavior).

+ */

+MODULE_PARM_DESC(no_vpe_idle_pg,

+ "Disable VPE idle power gating (1 = skip, 0 = normal). "

+ "Workaround for Strix Halo PMFW 100.6.0 SMU deadlock (default: 0)");

+module_param_named(no_vpe_idle_pg, amdgpu_no_vpe_idle_pg, int, 0444);

+

+/**

* DOC: bapm (int)

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

@@ -269,6 +269,7 @@

extern int amdgpu_wbrf;

extern int amdgpu_user_queue;

+extern int amdgpu_no_vpe_idle_pg;

extern uint amdgpu_hdmi_hpd_debounce_delay_ms;

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c

@@ -896,7 +896,8 @@

{

struct amdgpu_device *adev = ring->adev;

- schedule_delayed_work(&adev->vpe.idle_work, VPE_IDLE_TIMEOUT);

+ if (!amdgpu_no_vpe_idle_pg)

+ schedule_delayed_work(&adev->vpe.idle_work, VPE_IDLE_TIMEOUT);

}

How to use (without rebuilding kernel)

If your distro provides a way to patch the kernel (DKMS, out-of-tree module rebuild, or custom kernel), apply the patch above and boot with:


amdgpu.no_vpe_idle_pg=1

Or in /etc/modprobe.d/amdgpu-vpe.conf:


options amdgpu no_vpe_idle_pg=1

How to verify it’s working


# Enable SMU message tracing

echo 'module amdgpu file smu_cmn.c +p' | sudo tee /sys/kernel/debug/dynamic_debug/control

# Watch power messages (open/close YouTube videos in browser)

journalctl -kf -o short-precise | grep -iE "PowerUp|PowerDown"

Expected: PowerUpVcn0, PowerDownVcn0, PowerUpVcn1, PowerDownVcn1 cycling normally. No PowerUpVpe or PowerDownVpe messages after initial boot.

Request

If you have a Framework Desktop (Strix Halo) experiencing SMU deadlock / GPU freezes, please test this patch and report results. With 3-5 confirmations from different users I’ll submit it upstream to the amd-gfx mailing list.

2 Likes