[RESPONDED] VRAM is lost due to GPU reset! (followed by a crash)

  • Debian Bookworm. Gnome.
  • Batch 11 of Framework Laptop 13 (AMD Ryzen™ 7040 Series)

My framework laptop keeps crashing. So, I took at look at journalctl and noticed that it seems to be a problem with the gpu. Have others had this issue? How did you fix it?

Apr 05 01:06:32 df rtkit-daemon[1226]: Supervising 18 threads of 12 processes of 2 users.
Apr 05 01:06:32 df rtkit-daemon[1226]: Supervising 18 threads of 12 processes of 2 users.
Apr 05 01:06:48 df kernel: i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
Apr 05 01:07:02 df rtkit-daemon[1226]: Supervising 18 threads of 12 processes of 2 users.
Apr 05 01:07:02 df rtkit-daemon[1226]: Supervising 18 threads of 12 processes of 2 users.
Apr 05 01:07:12 df kernel: i2c_hid_acpi i2c-FRMW0005:00: failed to set a report to device: -121
Apr 05 01:07:17 df kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=74141, emitted seq=74143
Apr 05 01:07:17 df kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Apr 05 01:07:17 df kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:18 df kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 05 01:07:18 df kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 05 01:07:19 df kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
Apr 05 01:07:19 df kernel: [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
Apr 05 01:07:19 df kernel: [drm] VRAM is lost due to GPU reset!
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
Apr 05 01:07:19 df kernel: [drm] DMUB hardware initialized: version=0x08000500
Apr 05 01:07:19 df kernel: [drm] Watermarks table not configured properly by SMU
Apr 05 01:07:19 df kernel: [drm] kiq ring mec 3 pipe 1 q 0
Apr 05 01:07:19 df kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 1
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 1
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow start
Apr 05 01:07:19 df kernel: amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow done

Upgrade your GPU firmware.

1 Like

My firmware appears to have the latest updates. I try to keep on top of it, but maybe I’m missing something:

$ fwupdmgr get-updates
Devices with no available firmware updates: 
 • Fingerprint Sensor
 • UEFI dbx
 • WD BLACK SN850X 1000GB
Devices with the latest available firmware version:
 • System Firmware
No updates available

I had this issue as well.

I’d try upgrading the GPU firmware (by extracting the amdgpu folder of upstream linux-firmware into /lib/firmware, then regenerating initramfs with sudo update-initramfs -c -k $(uname -r)) first, as Mario suggested.

Myself, I ultimately had to upgrade Mesa (by apt-pinning trixie and installing from there) to fix it, which caused enough dependency conflicts with other packages that I ended up upgrading to Debian Trixie fully. Setting the kernel parameter amdgpu.sg_display=0 and the GPU mode to UMA_GAME_OPTIMIZED in BIOS settings also helped.

1 Like

Thanks for the tips and letting me know others are having this issue!

Would you happen to know if there is a way to compare my current firmware version with the version available in the link you provided?

I’m not sure. You can check the version of the equivalent Debian package, firmware-amd-graphics, with apt info firmware-amd-graphics - should be 20230210 on bookworm or 20230625 on trixie - and if necessary downgrade back to Debian’s version with sudo apt install --reinstall firmware-amd-graphics.

1 Like

Can you guys please get a bug filed fo fix this in Debian stable? This keeps coming up and they’re doing nothing about it.

2 Likes

I’ll try to put a bug report together and post the link when I’m done.

1 Like

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1068467

4 Likes

Thank you for helping us by helping Mario by filing the bug report, @northivanastan - much appreciated.

1 Like

@grant2 (or anyone else with this problem): could you upgrade your BIOS to version 3.05 and see if the GPU resets still occur? I’ve done so myself and disabled a few other workarounds I had in place (aside from upgraded mesa) and I haven’t been able to replicate the bug yet.

Thank you for bringing this to my attention.

I am going to wait for the BIOS upgrade to get mainlined and integrated into some kind of process (such as apt or fwupdmgr) instead of doing a manual upgrade. I’m trying to use my machine in a production setting, and I want to fiddle with it as little as possible.

That said, I did adjust the BIOS settings so that GPU mode now has the value UMA_GAME_OPTIMIZED. However, I have not had a chance to have an hour long conference call to see if my machine dies.

1 Like

As an update, toggling the BIOS setting so that GPU mode has the value UMA_GAME_OPTIMIZED did not solve my problem. My computer still dies after about 40 minutes of being on a conference call.

I am going to take another look at the BIOS update mentioned by @northivanastan. I will provide further updates about the success or failure of that approach.

1 Like

Unfortunately debian wouldn’t update stable for a bugfix like this, only for security bugs. Unfortunately for Debian, “stable” is meant to mean “not changing” not meant to mean “not buggy”. So stable doesn’t get new packages, it gets security patches backported to old packages as needed.

Better would be to see if a newer mesa package could be backported to stable and then it could be offered to upload to https://backports.debian.org/

2 Likes

My gnome session crashed, and I am wondering if it is related to this bug. I was able to press C-M F1 to log out and switch users to check out the log file:

Apr 30 15:33:53 df gnome-shell[2775]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
Apr 30 15:35:48 df gnome-shell[2775]: Window manager warning: last_user_time (142244507) is greater than comparison timestamp (142244506).  This most likely represents a buggy client sending inaccurate timestamps in messages such as _NET_ACTIVE_WINDOW.  Trying to work around...
Apr 30 15:35:48 df gnome-shell[2775]: Window manager warning: W1016 appears to be one of the offending windows with a timestamp of 142244507.  Working around...
Apr 30 15:44:38 df gnome-shell[2775]: Window manager warning: WM_TRANSIENT_FOR window 0x81032b for 0x81034b window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x80004c.
Apr 30 15:44:40 df gnome-shell[2775]: Window manager warning: Window 0x81035a sets an MWM hint indicating it isn't resizable, but sets min size 1 x 1 and max size 2147483647 x 2147483647; this doesn't make much sense.
Apr 30 15:44:40 df gnome-shell[2775]: Window manager warning: Window 0x81035a sets an MWM hint indicating it isn't resizable, but sets min size 1 x 1 and max size 2147483647 x 2147483647; this doesn't make much sense.
Apr 30 15:47:23 df gnome-shell[2775]: Window manager warning: WM_TRANSIENT_FOR window 0x810a96 for 0x810ad3 window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x80004c.
Apr 30 15:47:23 df gnome-shell[2775]: Window manager warning: WM_TRANSIENT_FOR window 0x810a96 for 0x810adf window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x80004c.
Apr 30 15:47:24 df gnome-shell[2775]: Window manager warning: WM_TRANSIENT_FOR window 0x810a96 for 0x810aed window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x80004c.
Apr 30 15:48:32 df gnome-shell[2775]: amdgpu: amdgpu_cs_query_fence_status failed.
Apr 30 15:48:32 df gnome-shell[17171]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
Apr 30 15:48:32 df gnome-shell[17171]: amdgpu: The process will be terminated.
Apr 30 15:48:32 df gnome-shell[2775]: Connection to xwayland lost
Apr 30 15:48:32 df gnome-shell[2775]: X Wayland crashed; attempting to recover
Apr 30 15:48:32 df systemd[2613]: Stopped target gnome-session-x11-services-ready.target - GNOME session X11 services.
░░ Subject: A stop job for unit UNIT has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A stop job for unit UNIT has finished.
░░ 
░░ The job identifier is 1197 and the job result is done.
Apr 30 15:48:32 df systemd[2613]: Stopping org.gnome.SettingsDaemon.XSettings.service - GNOME XSettings service...
░░ Subject: A stop job for unit UNIT has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A stop job for unit UNIT has begun execution.
░░ 
░░ The job identifier is 1198.
Apr 30 15:48:32 df gnome-shell[2775]: Using public X11 display :0, (using :1 for managed services)
Apr 30 15:48:32 df gnome-shell[2775]: amdgpu: amdgpu_cs_query_fence_status failed.
Apr 30 15:48:32 df gnome-shell[2775]: amdgpu: The CS has been rejected (-125). Recreate the context.
Apr 30 15:48:32 df gnome-shell[2775]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
Apr 30 15:48:32 df gnome-shell[2775]: amdgpu: The process will be terminated.
Apr 30 15:48:32 df systemd[2613]: org.gnome.SettingsDaemon.XSettings.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit UNIT has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 30 15:48:32 df nautilus[124600]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df gnome-clocks[43651]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df gnome-calendar[3559]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df gnome-terminal-[8989]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df xdg-desktop-por[3341]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df kdeconnectd[3150]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df evolution-alarm[3084]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df gsd-keyboard[3059]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df gsd-wacom[3083]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df xdg-desktop-por[3306]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df unknown[3047]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df gsd-power[3062]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df unknown[3061]: Error reading events from display: Broken pipe
Apr 30 15:48:32 df systemd[2613]: org.gnome.SettingsDaemon.Color.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit UNIT has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.