OS: Arch
Framework 16 AMD Radeon™ RX 7700S
Hello. I’ve been experiencing some GPU related crashes over the last few months. Whenever the GPU load gets high, the entire system freezes, though sounds still play for a few minutes until they crash. This holds across Linux native and proton, and is very consistent, 10 minutes of work resulting in a crash.
I thought it migth be a thermal thing, but I have observed this happen while nvtop reports 80 degrees, so that seemed unlikely.
I have tried:
- Rolling back linux firmware amdgpu
- Switching to the lts kernel
- Applying kernel values
Looking at the output of journalctl before the crash, I see the following:
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
Dec 14 13:52:10 soul kernel: pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to get fan speed(PWM)!
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:10 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:10 soul kernel: snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
Dec 14 13:52:11 soul kernel: snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
Dec 14 13:52:11 soul kernel: snd_hda_intel 0000:03:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to get fan speed(PWM)!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Dec 14 13:52:11 soul kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
Dec 14 13:52:14 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:14 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:17 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:17 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:20 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:20 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:22 soul kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_job_timedout - device unplugged skipping recovery on scheduler:sdma0
Dec 14 13:52:22 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:22 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:25 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:25 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:28 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:28 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:31 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Dec 14 13:52:31 soul kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
Dec 14 13:52:33 soul kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
I’ve tried following multiple threads I have seen around here and on the arch linux forum, but none have really worked for me.
What I have yet to try is a different operating system, though I would like to avoid that if possible.
Thanks in advance for your insight.