Hi All,
I’m wondering how this is all going?
So from my post a couple months ago, we’ve seen some Linux kernel and therefore AMD GPU driver updates, in which have seemed to greatly improved the graphics stability.
What I am not finding, and not sure if this should be a new thread, but the AMD GPU driver resets the GPU hardware, completely closing all application windows and restarting the, in my case, Wayland UI.
It does seem to be a memory thing, as I mainly operate with browser, and a few term windows using ssh.
The pattern I’ve seen is, with a number of browser tabs open, maybe 10-15, and working via cli over ssh to servers, all of a sudden without warning, everything goes black/blank and then returns to either a GDM login or desktop screen with no application windows.
dmesg shows;
[653723.632174] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[653897.788061] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[653912.815342] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[654076.162672] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[654091.381333] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[654117.063877] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[654203.486573] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[654462.866202] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[655525.281967] usb 3-1: USB disconnect, device number 5
[655623.531384] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[655703.681285] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[655964.993631] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[655972.063972] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[655981.833375] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656139.458248] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[656139.458509] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[656192.653202] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656319.320017] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656329.733425] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656448.799002] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[656462.800265] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656475.626385] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656513.306617] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656593.816364] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[656651.151653] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[656753.797088] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[656806.629060] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[657086.685545] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[657122.544965] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[657472.055380] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[657644.117357] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[657733.382291] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[657924.159076] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[658116.520939] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[658199.105193] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[658339.210908] i2c_hid_acpi i2c-FRMW0005:00: i2c_hid_get_input: incomplete report (7/65535)
[658356.361229] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[658384.413485] i2c_designware AMDI0010:00: i2c_dw_handle_tx_abort: lost arbitration
[658406.711407] i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
[658641.746664] [drm:gfx_v11_0_priv_reg_irq [amdgpu]] ERROR Illegal register access in command stream
[658641.757000] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=8072741, emitted seq=8072742
[658641.757204] [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xwayland pid 41047 thread Xwayland:cs0 pid 41058
[658641.757330] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[658641.839808] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Failed to initialize parser -125!
[658641.913434] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658641.913612] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.028328] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.028442] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.143113] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.143224] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.257835] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.257943] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.372496] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.372619] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.487318] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.487451] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.602141] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.602245] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.716935] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.717039] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658642.831764] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] ERROR MES failed to response msg=3
[658642.831876] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[658643.065000] [drm:gfx_v11_0_hw_fini [amdgpu]] ERROR failed to halt cp gfx
[658643.066478] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[658643.076536] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[658643.077089] [drm] PCIE GART of 512M enabled (table at 0x0000008000500000).
[658643.077281] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming…
[658643.079117] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[658643.080964] [drm] DMUB hardware initialized: version=0x08000500
[658643.086250] [drm] Watermarks table not configured properly by SMU
[658643.535578] [drm] kiq ring mec 3 pipe 1 q 0
[658643.538210] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[658643.538373] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[658643.539035] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[658643.539037] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[658643.539037] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[658643.539038] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[658643.539039] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[658643.539039] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[658643.539040] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[658643.539040] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[658643.539041] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[658643.539041] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[658643.539042] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 1
[658643.539043] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 1
[658643.539043] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[658643.542753] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow start
[658643.542756] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow done
[658643.542759] [drm] Skip scheduling IBs!
[658643.544000] [drm] ring gfx_32777.1.1 was added
[658643.545420] [drm] ring compute_32777.2.2 was added
[658643.547072] [drm] ring sdma_32777.3.3 was added
[658643.547085] [drm] ring gfx_32777.1.1 test pass
[658643.547213] [drm] ring gfx_32777.1.1 ib test pass
[658643.547223] [drm] ring compute_32777.2.2 test pass
[658643.547247] [drm] ring compute_32777.2.2 ib test pass
[658643.547781] [drm] ring sdma_32777.3.3 test pass
[658643.547852] [drm] ring sdma_32777.3.3 ib test pass
[658643.550427] amdgpu 0000:c1:00.0: amdgpu: GPU reset(4) succeeded!
It kinda looks like this is a planned operation from the driver, but a complete failure in recovery, as it’s very frustrating in having to re-open all your windows and login to everything again.
So before everyone asks, yes I’ve got the amdgpu kernel flag set, bios is the latest version, I understand Debian 12 is not “officially” FW supported, but it’s the Linux kernel right, same same but different Oh and all deb sec and updates installed.
Kernel version;
Linux xxxx 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Running Gnome 45
Debian 12
64GB RAM
Oh FYI, this happens on the latest BPO kernel too, but the battery life sux, like 75% less time, which I guess is to be expected for a non-optimised kernel build.
Thanks for your time everyone, keep up the great product FW people.
Paul.