FW13 amdgpu resets on heavy load?

Hello all,

Recently, I have been testing my laptop. I tried to test the gpu of this laptop but it just keeps going into a soft reset.
During an amdgpu reset, the screen freezes for a moment, goes black, then resumes before crashing the program stressing it out. The test I used was the Phoronix Test Suite/ Unigine Heaven test. The amdgpu particularly likes to reset every time when the camera moves close to the pebbles on the starting scene or moves around at the dragon scenes. This two particular scenes are what causes most of the amdgpu resets. Also, when playing Kerbal Space Program, the amdgpu also resets when clicking on the in-flight sidebar, which shouldn’t be a very gpu-intensive task. Nonetheless, I suspect a driver issue, but I hope it is not a hardware issue.

In each time it crashes and resets, it gives the same logs:

May 25 01:11:29 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
May 25 01:11:29 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
May 25 01:11:29 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=8272, emitted seq=8274
May 25 01:11:29 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: Process information: process heaven_x64 pid 6252 thread heaven_x64:cs0 pid 6342
May 25 01:11:29 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
May 25 01:11:31 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
May 25 01:11:31 localhost kernel: [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
May 25 01:11:31 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failure
May 25 01:11:31 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
May 25 01:11:33 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
May 25 01:11:33 localhost kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
May 25 01:11:33 localhost kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
May 25 01:11:33 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
May 25 01:11:33 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
May 25 01:11:33 localhost kernel: [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
May 25 01:11:33 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
May 25 01:11:33 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
May 25 01:11:33 localhost kernel: [drm] DMUB hardware initialized: version=0x08004E00
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
May 25 01:11:34 localhost kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) succeeded!

Coincidentally, I found that using amdgpu.vm_update_mode=3 reduces but not completely eliminate the probability of a amdgpu reset. However this method comes at the cost of frame rendering time. The fact that such this kernel parameter works to lower the chances of a gpu reset probably points to a problem in the gpu memory management.
I wonder if anyone has the same experience or expertise can chime in.

This sounds like a mesa bug. Can you reproduce with latest mesa? If so; please report it to the mesa bug tracker.

Yes, I am using Mesa 25.0.4.

It seems like others with different AMD gpus are reporting the same issue here and here. Guess I’ll just have to wait for it to gain notice of the Mesa developers.

GPU bugs aren’t always the same across GPU. Sometimes games expose differences from one GPU to another which needs workaround in the codebase.

I suggest you open your own bug report. Bugs are cheap, if it turns out to be the same, mesa developers will mark it a duplicate.