I think I am reproducing this problem on my Framework Desktop. I was able to stop my frequent hangs / freezes while using Ollama by disabling PSR. I am connected to the onboard HDMI port with my TV.
Another thing to add is that the issue only produced itself on the DP-connected monitor, but after testing and seeing that my issues went away after disconnecting the TV, I realized that when I connected the TV is when the issue started for me!
Well, the HDMI output is definitely part of my issue. I frequently see things like this happen still, but with only a “blackout” before the display comes back. When the HDMI is connected, I see the freeze. I guess disabling PSR had no effect on my issue, however similar it seems to me…
[ +2.693634] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ +0.000008] amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002 [ +0.000002] amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset [ +0.000005] amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1 [ +0.000004] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin! [ +0.000106] amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues [ +0.000031] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State [ +0.000922] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed [ +2.083439] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ +0.000009] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue [ +0.001985] amdgpu 0000:c3:00.0: amdgpu: MODE2 reset [ +0.026373] amdgpu: Freeing queue vital buffer 0x7f5c97200000, queue evicted [ +0.000023] amdgpu: Freeing queue vital buffer 0x7f5ca4a00000, queue evicted [ +0.000006] amdgpu: Freeing queue vital buffer 0x7f5d1ca00000, queue evicted [ +0.000004] amdgpu: Freeing queue vital buffer 0x7f630d400000, queue evicted [ +0.000004] amdgpu: Freeing queue vital buffer 0x7f6324400000, queue evicted [ +0.000782] amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume [ +0.000527] [drm] PCIE GART of 512M enabled (table at 0x0000008001300000). [ +0.000036] amdgpu 0000:c3:00.0: amdgpu: [drm] AMDGPU device coredump file has been created [ +0.000004] amdgpu 0000:c3:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data [ +0.000004] amdgpu 0000:c3:00.0: amdgpu: SMU is resuming… [ +0.010470] amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully! [ +0.013701] amdgpu 0000:c3:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09002E00 [ +0.067299] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ +0.000007] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ +0.000000] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 [ +0.000000] amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 4 on hub 8 [ +0.000000] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 6 on hub 8 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0 [ +0.000001] amdgpu 0000:c3:00.0: amdgpu: ring vpe uses VM inv eng 7 on hub 8 [ +0.004846] amdgpu 0000:c3:00.0: amdgpu: GPU reset(53) succeeded! [ +0.000015] amdgpu 0000:c3:00.0: [drm] device wedged, but recovered through reset
Do I have a recent enough firmware? What version do I need to find?
Caveat: This is output from a FW16 7840HS APU
My firmware info: cat /sys/kernel/debug/dri/0000:c1:00.0/amdgpu_firmware_info