I have 64GB of RAM official from Framework, and 4GB are allocated to the AMDGPU (UMA optimized options in BIOS, I don’t remember the exact name anymore).
USB-C dock : Thinkpad USB-C dock gen 2 with one screen 1920x1080@60hz using HDMI plugged in.
So I was just browsing the web with about 30 tabs open, and the laptop just froze a few times and the gpu did recover at first, but then I tried to fullscreen a video and it completely froze, no GPU recovery, seemed to have been tried after that.
kwin_wayland_drm: Pageflip timed out! This is a kernel bug
And have more logs on my phone as some were not written due to me hard shutting down the laptop as I had no ssh session running and keyboard was unresponsive, I will enable sysrq keys and hope I can reproduce it next time.
I tried to switch tty and it did work I saw a lot of drm errors (flip_done timed out and (CRTC:79:crtc-01/CONNECTOR:93eDP-1/PLANE:50:plane-3: commit wait timed_out) but the keyboard wasn’t working nor was an external USB one, it isn’t that it wasn’t working but basically the laptop was extremely slow, like if I type one key, 10 minutes later the letter appears.
Unplugging or replugging the USB-C dock did not change anything during the issue.
I’m very concerned about this issue cause reproducing it, I have no clue how to do that.
Tagging :
@Mario_Limonciello if you know if there is any firmware bump in the firmware provided in the next BIOS release please feel free to let us know, or if there is already a similar bug that has been reported upstream.
@Kieran_Levin : Please let us know any news on the BIOS release 3.0.7, I won’t be testing the Beta however.
@Matt_Hartley : Could Framework work more closely with AMD on squashing those very edge cases ? Is there something we as users could do to help better pin those issues (given that reproducibility is very hard) ?
Currently I have rebooted on 6.13.4 #1-NixOS SMP PREEMPT_DYNAMIC Fri Feb 21 13:11:21 UTC 2025 x86_64 GNU/Linux
Linux firmware : /nix/store/n58qg129fcnbp52gwblp2rg00p1v0z3x-linux-firmware-20250211-zstd
[11493.749110] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[11493.751769] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[11493.761816] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=2246503, emitted seq=2246505
[11493.761819] amdgpu 0000:c1:00.0: amdgpu: Process information: process .firefox-wrappe pid 3642 thread .firefox-w:cs0 pid 3730
[11493.761822] amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[11495.765672] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[11495.765679] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[11495.765852] amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failure
[11495.765854] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[11495.766076] Bluetooth: hci0: ACL packet for unknown connection handle 3837
[11497.977385] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[11497.977401] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[11498.246347] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[11498.248326] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[11498.285771] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[11498.286476] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[11498.286560] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[11498.289240] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[11498.296513] [drm] DMUB hardware initialized: version=0x08004B01
[11498.957456] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[11498.957464] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[11498.957466] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[11498.957469] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[11498.957470] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[11498.957472] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[11498.957473] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[11498.957474] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[11498.957476] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[11498.957478] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[11498.957480] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[11498.957481] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[11498.957483] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[11498.959553] amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) succeeded!
[11499.028123] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
EDIT (02/03/2025) :
Added KDE Plasma version
Added observation of slow down in Firefox during sweeping (hover across multiples YouTubes tabs fast)
Not able to reproduce yet.
I believe the firefox preview of pinned YouTube tabs (sweeping the cursor) across multiples tabs very fast (this slows down the laptop quite a bit, as if it is struggling a lot to render frame per seconds but only for the Firefox window, but it doesn’t trigger the issue), I haven’t been able to reproduce it yet.
Without the preview sweeping across the tab does not cause slow down (the tab are highlighted and there is no slow down).
I noticed a slow down I can reproduce only happens in Firefox (playing a video or not), the other windows are refreshing properly.
I can’t reproduce on non-pinned tabs but I haven’t tried to unpin the YouTube tabs and attempt to reproduce the slow down, no message is shown in dmesg, I’ll try and launch Firefox from cmdline and see if there is anything, I have no clue also on how to investigate Firefox crash dumps but I have some.
I’ll have to check if I have the same behavior under Chromium based browser with YouTube tabs and the tab preview feature, hovering the cursor over multiples tabs sweeping across them.
Have you tried adding amdgpu.dcdebugmask=0x10 as a kernel parameter? I’ve needed this basically for every kernel since I think something in 6.10 or 6.11 to stop completely randomly occurring errors slowing the laptop input to a crawl until either reboot or force AMD GPU reset.
E: some bug reports that have been filed on the PSR thing:
I’ve actually had more luck with amdgpu.dcdebugmask=0x12 (2+ weeks without the bug happening right now), if I recall correctly, I tried just 0x10 first and something was still broken.
PS. Just to demystify the number a little, it comes from this struct in the kernel code and is a hex sum. 0x12 disables memory stutter mode and PSR, 0x10 disables just PSR.
I do have the kernel parameter, to be more precise here is my kernel cmdline : amdgpu.dcdebugmask=0x10 amd_pstate=active rtc_cmos.use_acpi_alarm=1 loglevel=4
And I did have it on kernel 6.13.1 when I triggered the issue.
I’ll try to reproduce the issue with amdgpu.dcdebugmask=0x12