SMU deadlock / system freeze on Fedora 43

[Framework Desktop] amdgpu SMU deadlock / system freeze on Fedora 43 – gfx1151 / DCN 3.5 (Ryzen AI MAX+ 395)


Hardware & Software

  • Machine: Framework Desktop
  • CPU/GPU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (gfx1151)
  • RAM: 128 GB
  • OS: Fedora 43
  • Kernel: 6.19.11-200.fc43.x86_64
  • Mesa: 25.3.6
  • amd-gpu-firmware: 20260309

Symptom

The system hard-freezes or triggers a ~2-minute GPU reset (black screen) when using Chromium-based browsers (Microsoft Edge, Google Chrome) on GPU-heavy web pages. Confirmed triggers include OneDrive (hovering over video thumbnails, scrolling) and the Microsoft sign-in page. YouTube does not trigger the issue. Bluetooth also drops during the freeze, consistent with a full GPU bus hang.

With amdgpu.gpu_recovery=1 active, the system can sometimes recover on its own with a MODE2 ASIC reset taking approximately 2–3 minutes. Without it, a hard power-off is required.


Root Cause (identified via dmesg)

This is an SMU (System Management Unit) deadlock in the dcn35_smu_enable_pme_wa function, triggered during display pipe teardown when the GPU attempts to reset. The exact sequence:

  1. Edge/Chrome GPU process triggers a GFX ring workload
  2. The SMU is already busy with a pending command (SMN_C2PMSG_66:0x00000032)
  3. The driver attempts to disable gfxoff to prepare for GPU recovery — the SMU cannot respond
  4. ring gfx_0.0.0 times out; MES also fails to respond to reset requests
  5. A full MODE2 ASIC reset is eventually triggered

Key dmesg output:

amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
amdgpu: Failed to disable gfxoff!
amdgpu: ring gfx_0.0.0 timeout, signaled seq=37512, emitted seq=37514
amdgpu: MES failed to respond to msg=RESET
amdgpu: failed to reset legacy queue
amdgpu: Ring gfx_0.0.0 reset failed
amdgpu: GPU reset begin!
WARNING: dcn35_smu.c:175 at dcn35_smu_send_msg_with_param+0x166/0x190 [amdgpu]
Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
 dcn35_smu_enable_pme_wa+0x23/0x60 [amdgpu]
 link_set_dpms_off
 dcn31_reset_back_end_for_pipe
 dcn31_reset_hw_ctx_wrap
 dce110_apply_ctx_to_hw
 dc_commit_state_no_check
 dc_commit_streams
 dm_suspend
 amdgpu_device_pre_asic_reset
amdgpu: MODE2 reset
amdgpu: GPU reset succeeded
[drm] device wedged, but recovered through reset

Workarounds attempted

Parameter Effect
amdgpu.runpm=0 No effect on this bug
amdgpu.gpu_recovery=1 Enables MODE2 recovery instead of hard freeze — helpful but slow
amdgpu.dcdebugmask=0x10 No effect
amdgpu.gfxoff=0 No effect — bug is in dcn35_smu_enable_pme_wa, which bypasses this flag
power_dpm_force_performance_level=high Delays onset but does not prevent the crash

Assessment

This appears to be a kernel driver bug specific to gfx1151 / DCN 3.5 hardware. The dcn35_smu_enable_pme_wa function does not handle a busy SMU gracefully — when the SMU already has a pending command, any subsequent message sent during display teardown deadlocks the entire reset path. A proper fix would need to come from AMD’s kernel team adding a busy-check or timeout-skip in that function.

I am also filing this at Making sure you're not a bot! with the full log output.

Is anyone else on the Framework Desktop hitting this? Any additional workarounds welcome.

3 Likes

hi there… yes i’m seeing similar logs with my FW Desktop 64GB

Apr 14 00:52:05 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 14 00:52:05 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to power gate VCN instance 1!
Apr 14 00:52:05 SwagAI kernel: [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.

Apr 14 00:52:20 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 14 00:52:20 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to disable gfxoff!

Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to disable gfxoff!
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 timeout, signaled seq=101884, emitted seq=101886
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu:  Process chrome pid 326559 thread chrome:cs0 pid 326575
Apr 14 00:52:36 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: Starting vcn_unified_0 ring reset

Apr 14 00:52:47 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 14 00:52:47 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to retrieve enabled ppfeatures!

Apr 14 00:53:07 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Apr 14 00:53:07 SwagAI kernel: amdgpu 0000:c3:00.0: amdgpu: failed to reg_write_reg_wait

I’m on CachyOS with 6.19.12-1-cachyos ord 6.18-lts … happens on both

EDIT:

Operating System: CachyOS Linux
KDE Plasma Version: 6.6.4
KDE Frameworks Version: 6.25.0
Qt Version: 6.11.0
Kernel Version: 6.19.12-1-cachyos (64-bit)
Graphics Platform: Wayland
Processors: 32 × AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Memory: 64 GiB of RAM (62.1 GiB usable)
Graphics Processor: Radeon 8060S Graphics
Manufacturer: Framework
Product Name: Desktop (AMD Ryzen AI Max 300 Series)
System Version: A4

Framework support had me reset my mainboard to see if that would resolve the problem. I was able to reproduce the problem at will previously, but since the reset, things have been running smoothly.

interesting! how do i do a mainboard reset?

Here are the instructions: Mainboard Reset - Framework Guides

1 Like

Ahh, so that’s how it works! I was planning to install another fan in the next few days anyway, so I can just do that at the same time.

See Mainboard Reset - Framework Guides

Update – April 15

Following advice from Framework support, I performed a chipset reset on the mainboard. This appeared to help initially — I could no longer reproduce the crash on OneDrive immediately afterward. However, the system crashed again today after opening a new tab in Edge and navigating to a website, confirming the issue is not specific to OneDrive or any particular web content.

Additional workarounds tried and confirmed ineffective:

  • amdgpu.mes=0 — no effect; MES scheduler is not the root cause
  • power_dpm_force_performance_level=high — delays onset but does not prevent the crash
  • Chipset reset — no effect

New finding in latest crash log:

A new error now appears before the ring timeout, suggesting the hang is occurring slightly earlier in the pipeline:

amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
amdgpu: failed to reg_write_reg_wait

The full crash sequence remains the same — dcn35_smu_enable_pme_wa deadlocks the reset path, requiring a full MODE2 ASIC reset (~2 minutes to recover).

Active kernel parameters during this crash:

amdgpu.runpm=0 amdgpu.gpu_recovery=1 amdgpu.dcdebugmask=0x10 amdgpu.gfxoff=0 amdgpu.mes=0

Firmware versions (from amdgpu_firmware_info):

  • SMC: program 10, version 100.6.0 (0x0a640600)
  • MES_KIQ: version 6, firmware 0x6f
  • MES: version 1, firmware 0x86
  • DMCUB: 0x09003e00
  • VCN: 0x09118010

This is clearly a kernel driver bug in dcn35_smu.c that no amount of kernel parameters or hardware resets will fix. I have also filed this upstream at Making sure you're not a bot! . If anyone has found a working workaround or has seen a kernel patch addressing dcn35_smu_enable_pme_wa, please reply.

1 Like

The evidence I’m seeing on my machine agrees with your hypothesis. Here’s a recent trace that implicates dcn35_smu_enable_pme_wa:

[  137.007305] ------------[ cut here ]------------
[  137.007312] WARNING: CPU: 6 PID: 12 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn35/dcn35_smu.c:175 dcn35_smu_send_msg_with_param+0x166/0x190 [amdgpu]
[  137.007876] Modules linked in: uinput snd_seq_dummy snd_hrtimer des3_ede_x86_64 des_generic libdes md4 nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_f
ib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables qrtr bnep sunrpc binfmt_misc v
fat fat intel_rapl_msr amd_atl intel_rapl_common snd_hda_codec_alc269 snd_hda_scodec_component snd_hda_codec_realtek_lib snd_hda_codec_generic mt7925e snd_hda_codec_atihdmi mt7925_comm
on snd_hda_codec_hdmi btusb edac_mce_amd mt792x_lib snd_hda_intel btrtl mt76_connac_lib snd_hda_codec uvcvideo btintel kvm_amd mt76 btbcm snd_hda_core uvc btmtk videobuf2_vmalloc leds_
cros_ec cros_ec_sysfs cros_ec_chardev cros_ec_hwmon gpio_cros_ec led_class_multicolor snd_usb_audio bluetooth snd_intel_dspcfg spd5118 videobuf2_memops kvm gpio_keys cros_ec_dev snd_in
tel_sdw_acpi mac80211 xpad snd_usbmidi_lib videobuf2_v4l2 videobuf2_common snd_ump ff_memless snd_seq rapl snd_hwdep
[  137.007926]  joydev snd_rawmidi videodev snd_seq_device libarc4 mc wmi_bmof snd_pcm pcspkr cfg80211 snd_timer r8169 i2c_piix4 amd_pmf snd amdxdna i2c_smbus soundcore amdtee rfkill a
md_sfh realtek cros_ec_lpcs tee cros_ec amd_pmc cros_ec_proto soc_button_array platform_profile tcp_bbr tun loop nfnetlink zstd zram lz4hc_compress lz4_compress overlay erofs netfs dm_
crypt amdgpu ucsi_acpi typec_ucsi typec amdxcp drm_panel_backlight_quirks gpu_sched drm_buddy drm_ttm_helper ttm drm_exec i2c_algo_bit drm_suballoc_helper drm_display_helper nvme polyv
al_clmulni thunderbolt nvme_core ghash_clmulni_intel cec nvme_keyring video nvme_auth wmi sp5100_tco vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd ntsync pkcs8_key_par
ser fuse i2c_dev uhid vhba kvmfr gcadapter_oc
[  137.007979] CPU: 6 UID: 0 PID: 12 Comm: kworker/u128:0 Not tainted 6.17.7-ba29.fc43.x86_64 #1 PREEMPT(lazy) 
[  137.007984] Hardware name: Framework Desktop (AMD Ryzen AI Max 300 Series)/FRANMFCP04, BIOS 03.04 11/19/2025
[  137.007987] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[  137.007995] RIP: 0010:dcn35_smu_send_msg_with_param+0x166/0x190 [amdgpu]
[  137.008412] Code: 00 be 9b 62 01 00 48 c7 c1 50 c9 85 c1 e8 12 3f dc ff 48 8b 03 48 8b 40 10 48 8b 38 48 85 ff 0f 84 91 47 33 00 e9 88 47 33 00 <0f> 0b 48 89 df e8 30 fe ff ff 48 8b
 13 48 8b 52 10 48 8b 3a 48 85
[  137.008415] RSP: 0018:ffffcf054018f830 EFLAGS: 00010246
[  137.008418] RAX: 0000000000000000 RBX: ffff8ab38e178400 RCX: 0000000000000006
[  137.008420] RDX: 0000000000007ddc RSI: 00000000000074fa RDI: ffff8ab388cb0600
[  137.008422] RBP: 00000000ffffffff R08: ffff8ab382303a80 R09: ffffcf054018f7c8
[  137.008423] R10: ffffcf054018f7c0 R11: 0000000000000001 R12: 000000000000000d
[  137.008425] R13: 0000000000000000 R14: ffff8ab3822fdda0 R15: 0000000000000338
[  137.008426] FS:  0000000000000000(0000) GS:ffff8ac306546000(0000) knlGS:0000000000000000
[  137.008428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.008430] CR2: 00002c9c006ff020 CR3: 0000000d65c2c000 CR4: 0000000000f50ef0
[  137.008432] PKRU: 55555554
[  137.008434] Call Trace:
[  137.008437]  <TASK>
[  137.008441]  dcn35_smu_enable_pme_wa+0x23/0x50 [amdgpu]
[  137.008820]  link_set_dpms_off+0x111/0x400 [amdgpu]
[  137.009262]  dcn31_reset_back_end_for_pipe.isra.0+0x126/0x300 [amdgpu]
[  137.009681]  dcn31_reset_hw_ctx_wrap+0xf0/0x220 [amdgpu]
[  137.010069]  dce110_apply_ctx_to_hw+0x68/0x360 [amdgpu]
[  137.010463]  dc_commit_state_no_check+0x38e/0xe40 [amdgpu]
[  137.010824]  dc_commit_streams+0x2eb/0x650 [amdgpu]
[  137.011155]  dm_suspend+0x270/0x320 [amdgpu]
[  137.011561]  amdgpu_ip_block_suspend+0x24/0x50 [amdgpu]
[  137.011801]  amdgpu_device_ip_suspend_phase1+0x92/0xf0 [amdgpu]
[  137.012055]  amdgpu_device_ip_suspend+0x2c/0x80 [amdgpu]
[  137.012300]  amdgpu_device_pre_asic_reset+0xed/0x510 [amdgpu]
[  137.012551]  amdgpu_device_asic_reset+0x52/0x205 [amdgpu]
[  137.013023]  amdgpu_device_gpu_recover.cold+0x22b/0x234 [amdgpu]
[  137.013426]  amdgpu_job_timedout.cold+0x111/0x24c [amdgpu]
[  137.013846]  drm_sched_job_timedout+0x7a/0x160 [gpu_sched]
[  137.013852]  process_one_work+0x18f/0x350
[  137.013860]  worker_thread+0x25a/0x3a0
[  137.013864]  ? __pfx_worker_thread+0x10/0x10
[  137.013867]  kthread+0xf9/0x240
[  137.013871]  ? __pfx_kthread+0x10/0x10
[  137.013873]  ? __pfx_kthread+0x10/0x10
[  137.013875]  ret_from_fork+0xf1/0x110
[  137.013881]  ? __pfx_kthread+0x10/0x10
[  137.013883]  ret_from_fork_asm+0x1a/0x30
[  137.013888]  </TASK>
[  137.013889] ---[ end trace 0000000000000000 ]---
1 Like

Same bug since 2 day, same distro.

I disabled hardware video decode in chromium

I think it’s new, i don’t have it before

I switched to using Firefox and updated to Kernel 7.0.0.
So far, I haven’t experienced any freezes, but I also haven’t used the PC that much over the past few days.

Same issue for me, but on ubuntu.

- Ubuntu 25.10 x86_64

  • GNOME 49.0
  • Mutter (Wayland)
  • AMD RYZEN AI MAX+ 395

[ 3536.625642] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3536.625654] amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VPE!
[ 3536.625658] [drm:amdgpu_dpm_enable_vpe [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
[ 3541.439018] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3541.439025] amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VCN instance 1!
[ 3541.439027] [drm:amdgpu_dpm_enable_vcn [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
[ 3544.902422] amdgpu 0000:c2:00.0: amdgpu: Dumping IP State
[ 3549.702445] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3549.702453] amdgpu 0000:c2:00.0: amdgpu: Failed to disable gfxoff!
[ 3554.505681] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3554.505693] amdgpu 0000:c2:00.0: amdgpu: Failed to disable gfxoff!
[ 3559.308523] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3559.308535] amdgpu 0000:c2:00.0: amdgpu: Failed to disable gfxoff!
[ 3564.111256] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3564.111268] amdgpu 0000:c2:00.0: amdgpu: Failed to disable gfxoff!
[ 3564.111361] amdgpu 0000:c2:00.0: amdgpu: Dumping IP State Completed
[ 3564.111451] amdgpu 0000:c2:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[ 3564.111458] amdgpu 0000:c2:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[ 3564.111461] amdgpu 0000:c2:00.0: amdgpu: ring sdma0 timeout, signaled seq=74533, emitted seq=74537
[ 3564.111465] amdgpu 0000:c2:00.0: amdgpu: GPU reset begin!
[ 3568.703656] amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
[ 3568.703665] amdgpu 0000:c2:00.0: amdgpu: Failed to disable gfxoff!

I’ve created a bug entry @ ubuntu / launchpad: Bug #2148686 “FrameworkDesktop: System randomly hangs / freezes ...” : Bugs : xserver-xorg-video-amdgpu package : Ubuntu

I had freezing with my GMKTEK 395 AI MAX machine at work but could never get logs because of freezing. I am posting on my home pc a framework desktop. I had a custom notification extension for gnome installed on my work pc and that would freeze my system immediately when a notification from edge would come in when I visited godaddy.com that was the only website that crashed my edge but I am sure other sites or situations would have crashed it also. So it turned out edge wasnt crashing my system, it was this poorly written gnome extension and those can cause a cpu HALT somehow. I dont have logs, the only thing I have is godaddy.com would cpu HALT. Not saying thats the issue here in this thread but I did have freezing I thought was edge and with no logs it was a struggle. System is fedora 43 with Gnome home pc is fedora 43 KDE no problems with my framework at all since i have had it.

I’m on NixOS, I’ve been running into system lockups recently and I’ve tried these kernels: 6.18 LTS, 6.12 LTS, 6.19 and 7. The GUI seems to freeze for 30 seconds to a minute and in that time I can ssh in. Once the monitors all go black I no longer seem to be able to connect with SSH and have to force a reset with holding the power button or unplugging the AC cord.

I do not do much that is extremely GPU intensive, no gaming or LLMs, but I do use chromium for youtube, reddit, duckduckgo, forums and other complicated websites. I do have 4 1440p monitors including a 5120x1440 ultrawide.

Log from last crash on 6.18 LTS:

Apr 19 12:24:07 fwdesktop kernel: cros-ec-dev cros-ec-dev.2.auto: Some logs may have been dropped...
Apr 19 12:24:12 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 19 12:24:12 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VPE!
Apr 19 12:24:12 fwdesktop kernel: [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
Apr 19 12:24:16 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Apr 19 12:24:16 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VCN instance 0!
Apr 19 12:24:16 fwdesktop kernel: [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. 
Apr 19 12:24:18 fwdesktop kernel: amdgpu 0000:c2:00.0: amdgpu: Dumping IP State
Apr 19 12:24:22 fwdesktop kernel: r8169 0000:bf:00.0 enp191s0: NETDEV WATCHDOG: CPU: 6: transmit queue 0 timed out 5306 ms

Forgot to add, I have the 128GB model and I have tried multiple SSDs with the same result. I have tried on firmware version 0.0.3.3 and 0.0.3.4.

I too got this problem when upgrading to NixOS 25.11. After experimenting with different kernel and firmware combinations I found that it was the Mesa upgrade that triggered the freezes. I’ve been using a 25.05 overlay to pull in the old Mesa 25.0.7 as a workaround.

1 Like

I’ll try that later today when I get a chance. Just in case it matters can you check what Kernel and linux-firmware version you are using so I can try and replicate your working setup?

My kernel (6.17.9) and firmware (20251111) are still stuck on 25.05 too, as that was the last config I tried when I got it working and I was too sick of rebooting to try the Mesa downgrade isolated with a current kernel.

I’m doing this in my (probably very unidiomatic) system flake:

{
  inputs = {
    ...
    nixpkgs-2505.url = "github:NixOS/nixpkgs/df3793a20f2e3f62359faf5eb084d3bb88347682";
    ...
  };
  let
    ...
    pkgs2505 = import inputs.nixpkgs-2505 {
      inherit system;
      config.allowUnfree = true;
    };
    kernelPackages_2505 = pkgs2505.linuxPackages_latest;
    ...
  in {
    nixosConfigurations.frametop = nixpkgs.lib.nixosSystem {
      inherit system;

      specialArgs = {
        inherit inputs outputs kernelPackages_2505;
      };
      modules = baseModules ++ [
        ({ kernelPackages_2505, ... }: {
          boot.kernelPackages = kernelPackages_2505;
          hardware.firmware = [ pkgs2505.linux-firmware ];

          hardware.graphics = {
            enable = true;
            package = pkgs2505.mesa;
          };
        })
        ...
      ];
    };
  };
}
1 Like

Seems to be working, thank you so much. If it keeps working, I’ll edit this post with the version of mesa, linux and linux-firmware tomorrow so people on other distros can fix it too.

Sounds like the issue I encounter. I documented my findings here: [gfx1151] Silent hard hang under sustained vLLM inference with MES 0x86 — amdgpu hangcheck never fires · Issue #6165 · ROCm/ROCm · GitHub

I had this issue on the FW16, and there is an actual kernel/driver/firmware bug in 6.18 and 6.19 that causes these problems. Best option is either try 7.0 (I have not had problems yet after hours of gaming) or roll back to an earlier version.

This forum post has a bit more info: Attn: critical bugs in amdgpu driver included with kernel 6.18.x / 6.19.x!