AMD GPU MES Timeouts Causing System Hangs on Framework Laptop 13 (AMD AI 300 Series)

I think I am reproducing this problem on my Framework Desktop. I was able to stop my frequent hangs / freezes while using Ollama by disabling PSR. I am connected to the onboard HDMI port with my TV.

Another thing to add is that the issue only produced itself on the DP-connected monitor, but after testing and seeing that my issues went away after disconnecting the TV, I realized that when I connected the TV is when the issue started for me!

Thank you for this @Jan_Theofel !

Well, the HDMI output is definitely part of my issue. I frequently see things like this happen still, but with only a “blackout” before the display comes back. When the HDMI is connected, I see the freeze. I guess disabling PSR had no effect on my issue, however similar it seems to me…

[ +2.693634] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ +0.000008] amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ +0.000002] amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ +0.000005] amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
[ +0.000004] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[ +0.000106] amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
[ +0.000031] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
[ +0.000922] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
[ +2.083439] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ +0.000009] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[ +0.001985] amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
[ +0.026373] amdgpu: Freeing queue vital buffer 0x7f5c97200000, queue evicted
[ +0.000023] amdgpu: Freeing queue vital buffer 0x7f5ca4a00000, queue evicted
[ +0.000006] amdgpu: Freeing queue vital buffer 0x7f5d1ca00000, queue evicted
[ +0.000004] amdgpu: Freeing queue vital buffer 0x7f630d400000, queue evicted
[ +0.000004] amdgpu: Freeing queue vital buffer 0x7f6324400000, queue evicted
[ +0.000782] amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
[ +0.000527] [drm] PCIE GART of 512M enabled (table at 0x0000008001300000).
[ +0.000036] amdgpu 0000:c3:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[ +0.000004] amdgpu 0000:c3:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[ +0.000004] amdgpu 0000:c3:00.0: amdgpu: SMU is resuming…
[ +0.010470] amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
[ +0.013701] amdgpu 0000:c3:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09002E00
[ +0.067299] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ +0.000007] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ +0.000000] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ +0.000000] amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 4 on hub 8
[ +0.000000] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 6 on hub 8
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring vpe uses VM inv eng 7 on hub 8
[ +0.004846] amdgpu 0000:c3:00.0: amdgpu: GPU reset(53) succeeded!
[ +0.000015] amdgpu 0000:c3:00.0: [drm] device wedged, but recovered through reset

In my case it was exactly the opposite:
When I had and screen attached the system did never crash.
It only crashed when NOT connected to a screen…

Does this help:

The “cwsr_enable=0” bit.

Yeah this actually fixed it for me!!!

Instead of the module parameter there is a kernel fix in 6.17.2 and 6.18-rc1 if you have new enough microcode for GPU.

What kernel are you on and what version of microcode (debugfs amdgpu_firmware_info file will tell you)

Do I have a recent enough firmware? What version do I need to find?
Caveat: This is output from a FW16 7840HS APU
My firmware info:
cat /sys/kernel/debug/dri/0000:c1:00.0/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 35, firmware version: 0x00000063
PFP feature version: 35, firmware version: 0x00000067
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x0000008a
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x0000000f
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 35, firmware version: 0x00000043
IMU feature version: 0, firmware version: 0x0b012d00
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 553648378, firmware version: 0x210000fa
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000049
TA DTM feature version: 0x00000000, firmware version: 0x1200001a
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x004c6000 (76.96.0)
SDMA0 feature version: 60, firmware version: 0x00000017
VCN feature version: 0, firmware version: 0x09118016
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x08005300
TOC feature version: 0, firmware version: 0x0000000b
MES_KIQ feature version: 6, firmware version: 0x00000106
MES feature version: 1, firmware version: 0x00000080
VPE feature version: 0, firmware version: 0x00000000
VBIOS version: 113-PHXGENERIC-001

0x7f is the min version. You’ve got new enough. I don’t think you have the same issue.