AMD GPU Crash under heavy load

OS: Arch
Framework 16
AMD Radeon™ RX 7700S
Hello. I’ve been experiencing some GPU related crashes over the last few months. Whenever the GPU load gets high, the entire system freezes, though sounds still play for a few minutes until they crash. This holds across Linux native and proton, and is very consistent, 10 minutes of work resulting in a crash.

I thought it migth be a thermal thing, but I have observed this happen while nvtop reports 80 degrees, so that seemed unlikely.
I have tried:

  • Rolling back linux firmware amdgpu
  • Switching to the lts kernel
  • Applying kernel values

Looking at the output of journalctl before the crash, I see the following:

Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
Dec 14 13:52:10 soul kernel: pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to get fan speed(PWM)!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
Dec 14 13:52:11 soul kernel: snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
Dec 14 13:52:11 soul kernel: snd_hda_intel 0000:03:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to get fan speed(PWM)!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
Dec 14 13:52:14 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:14 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:17 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:17 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:20 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:20 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:22 soul kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_job_timedout - device unplugged skipping recovery on scheduler:sdma0
Dec 14 13:52:22 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:22 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:25 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:25 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:28 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:28 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:31 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:31 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:33 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE

I’ve tried following multiple threads I have seen around here and on the arch linux forum, but none have really worked for me.
What I have yet to try is a different operating system, though I would like to avoid that if possible.
Thanks in advance for your insight.

Hi,

It appears to be loosing the pcie link to the gpu.
It might be worth reseating the gpu interposer and checking all 6 screws are not loose.

If that does not help, raise a FW support request via their web site.
It is mainly only other users, like you, here.

Thanks for the tip. I checked the screws and reattached the gpu for good measure. It has not really helped I’m afraid.
I’ll contact support.