AMD Framework 13 with Debian still crashing a year later

Hi,
I’m posting this as a sort of meta issue, since I can’t really pin it down to a single problem or cause. I’ll try to keep this updated, since Debian 13 release is getting closer (first freeze stage starts in a month) and more people are likely going to upgrade and perhaps run into the same issues.

The past year

I got the laptop last year around this time, and although I’m happy with the device overall, the GPU keeps giving me trouble, seemingly with no end in sight.

It started very bad: weekly crashes that led to filesystem corruption and unbootable system. But it turned out the problems were caused by old amdgpu firmware in Debian 12. After sorting that out and updating to BIOS 3.05, it got much better, so I summarized the important steps regarding firmware updates at Debian wiki.

Then I had about 6 perfect months with no crashes (which means it is unlikely my hardware is defective; memtest86+ also finishes clean) – until one day the machine froze again. And then again a few weeks later. I must have done some partial upgrades though backports in the mean time (I don’t remember for sure), possibly introducing a regression. As the hardware is still relatively new, I thought more bugs are probably being resolved, and it would be better to do a full upgrade.

So a month ago I updated to Debian 13 (trixie / testing, kernel 6.12.12, amdgpu firmware 20241210, Mesa 24.3.4-3, Xfce 4.20, xserver-xorg 1:7.7+24) – and it got even worse.

Current problems

Sometimes it takes days between crashes, but on the other extreme, two days ago I experienced 3 to 4 different types of crashes in a single day:

  • sudden freeze: happens most often while watching video or when the machine is under load; the screen simply freezes (in one case followed by image corruption – a few large blocks of the now static image were moved to a different place). Only the mouse cursor still moves. As if to make a point, I got one as I was preparing this post – luckily saved in a text editor and not in a web form. :slight_smile:

  • black screen: after blindly typing password to xscreensaver while an external display wakes up, I find that only the cursor is visible (and still moves) and everything else is black. Possibly same as “sudden freeze”, but happening while screen is locked and therefore black? I ran umrgui over SSH and the only thing out of normal was on the Buffer Objects tab: normally it shows around 4 copies of my desktop and one smaller bitmap with the cursor bitmap, but now it had only the desktop bitmap, also black). Triggering amdgpu_gpu_recover did not help (after recovery the screen stayed black).

  • white screen: external screen is frozen, laptop screen turned white. Happened to me only once, 40 minutes after I rebooted the laptop from the “black screen” freeze. Unlike the “black screen” crash, triggering GPU recovery did help and everything worked fine after that.

  • HW accelerated video playback (4K AV1, possibly others) now also results in a GPU hang – here I wanted to write “but it automatically recovers a few seconds later”, but when I went to confirm that, the recovery triggered but failed to recover. So “usually recovers”, but can also hang completely.

  • Apart from all that, I found out that my NVMe drive also randomly drops PCIe speed a few days after boot (sudo lspci -vv -s 02:00.0|grep LnkSta: shows LnkSta: Speed 2.5GT/s (downgraded), Width x2 (downgraded)). But that may or may not be an issue on the drive’s side, I have no way to tell.

Workarounds

Starting with the SSD issue, I found that instead of a full reboot, you can force the link speed to be renegotiated by resetting only the NVMe drive:

# cd /sys/devices/pci0000:00/0000:00:02.4/0000:02:00.0/
# echo bus > reset_method
# echo 1 > reset

I did not find any less “brute force” way to do it, but it seems to work well for me with no side effects (though your mileage may vary – I have no way to tell if all drives can take this gracefully). If it works well for you, you could set it up as an hourly cron job. (Don’t forget to change the PCI device path if you came here through search and have a different laptop.)

For the GPU side, I tried some older known workarounds (booting with amdgpu.sg_display=0 and setting VRAM allocation to “gaming mode”), with no difference.

Sometimes the automatic GPU recovery kicks in (mainly with the accelerated video crash), but if it doesn’t, you may try to trigger it manually (as root) over SSH, or perhaps by switching to TTY (if it still works):
# cat /sys/kernel/debug/dri/0000\:c1\:00.0/amdgpu_gpu_recover

If it does not help, kiss your unsaved data goodbye and reboot (having all sysrq flags required for SysRq+REISUB enabled may be helpful – it’s a safer way to hard-reboot than just holding the power button).

Trying to reset the GPU the same way as the SSD just breaks things even further, so do not bother. :slight_smile:

Fixes and outlook

I found several threads with similar problems (one even popped up today, before I finished this post, and the log seems similar to the “white screen” freeze), but they rarely reach a conclusion, or end up suspecting a hardware issue that is never confirmed. The only fix I can think of is downgrading to an older kernel or firmware release, but with time between crashes ranging from minutes to weeks, it’s pretty hard to conclusively find a combination that works (and won’t be obsolete in a few months due to lack of security updates).

Although most of the bugs seem to be related to the GPU and I have a netconsole pointed to another machine, I have no idea if the cause lies with the kernel, amdgpu driver, Mesa, BIOS, or something else. So while I would like to at least report the issues, I’m not sure where to send my logs (except perhaps here – see following post). I also have umr installed, but apart from running umrgui and looking at pretty graphs I don’t really know how to get anything useful from it. If anyone could give me some pointers in this regard, I can try to collect more data when the next crash comes.

I generally try to be patient with Linux hardware support (especially considering Debian’s “always out of date” status), but at this point the 7640U/7840U is almost two years old, and even with new kernel and firmware it’s seemingly not getting any better. Is there an expectation that the iGPU will eventually be stable, or is the platform inherently “fragile” in some way that makes it hard to add support for newer platforms (these days probably Zen 5) without breaking the older ones?

Thanks!

1 Like

Logs

Splitting logs to separate posts, as it’s a lot of text that makes the submission form unhappy (“An error occurred: Body is limited to 32000 characters; you entered 50929.”)

I had a netconsole pointed to a file on another machine, but I only managed to match up the HW accelerated video crash and the “white screen” crash with anything useful (I did not think of cutting out relevant data after each crash so I had to look for it based on timestamps):

  • accelerated video crash:
[45934.185818] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[45934.190090] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.190144] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.190201] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[45934.190275] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 timeout, signaled seq=211118, emitted seq=211120
[45934.190283] amdgpu 0000:c1:00.0: amdgpu: Process information: process mpv pid 2002076 thread mpv:cs0 pid 2002089
[45934.190291] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[45934.190660] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.190728] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.190962] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.191130] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.191242] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.191265] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.246493] systemd-journald[441]: Successfully sent stream file descriptor to service manager.
[45934.512347] [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002n
[45934.702777] [drm] Register(0) [regUVD_RB_RPTR] failed to reach value 0x00000080 != 0x00000000n
[45934.893090] [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002n
[45934.897886] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[45934.903262] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.903521] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.928678] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[45934.929874] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.929887] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[45934.929904] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.929929] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[45934.930088] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.930127] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.930480] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[45934.934018] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.934087] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.934529] gmc_v11_0_process_interrupt: 21 callbacks suppressed
[45934.934534] amdgpu 0000:c1:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:40 vmid:3 pasid:32782)
[45934.934551] amdgpu 0000:c1:00.0: amdgpu:  in process mpv pid 2002076 thread mpv:cs0 pid 2002089)
[45934.934559] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000000000fff000 from client 18
[45934.934566] amdgpu 0000:c1:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00343850
[45934.934572] amdgpu 0000:c1:00.0: amdgpu: 	 Faulty UTCL2 client ID: unknown (0x1c)
[45934.934578] amdgpu 0000:c1:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[45934.934583] amdgpu 0000:c1:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[45934.934589] amdgpu 0000:c1:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
[45934.934594] amdgpu 0000:c1:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[45934.934599] amdgpu 0000:c1:00.0: amdgpu: 	 RW: 0x1
[45934.934872] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.934900] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.935025] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.935048] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45934.935162] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45934.936082] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45935.125588] amdgpu 0000:c1:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_unified_0 test failed (-110)
[45935.125719] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v4_0> failed -110
[45935.125843] amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) failed
[45935.125879] amdgpu 0000:c1:00.0: amdgpu: GPU reset end with ret = -110
[45935.125896] amdgpu 0000:c1:00.0: amdgpu: GPU Recovery Failed: -110
[45935.126151] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45935.126215] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45935.127002] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45935.127025] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45935.657786] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45935.785768] [drm] Fence fallback timer expired on ring sdma0
[45936.135726] [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002n
[45936.169771] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45936.326113] [drm] Register(0) [regUVD_RB_RPTR] failed to reach value 0x00000040 != 0x00000000n
[45936.516520] [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002n
[45936.681842] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45937.193803] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45937.709801] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45938.218015] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45938.729799] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45939.242503] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45939.753781] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45940.265803] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45940.777956] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45941.289792] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45941.801954] [drm] Fence fallback timer expired on ring gfx_0.0.0
[45945.193838] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[45945.194471] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[45945.194489] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 timeout, signaled seq=211120, emitted seq=211120
[45945.194497] amdgpu 0000:c1:00.0: amdgpu: Process information: process mpv pid 2002076 thread mpv:cs0 pid 2002089
[45945.194505] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[45945.198303] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45945.198378] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45945.198817] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45945.199243] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[45945.199618] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[45945.199670] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[46158.954022] INFO: task kworker/u48:2:1883862 blocked for more than 120 seconds.
[46158.954043]       Tainted: G        W          6.12.12-amd64 #1 Debian 6.12.12-1
[46158.954050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46158.954055] task:kworker/u48:2   state:D stack:0     pid:1883862 tgid:1883862 ppid:2      flags:0x00004000
[46158.954067] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[46158.954084] Call Trace:
[46158.954090]  <TASK>
[46158.954099]  __schedule+0x3e6/0xbf0
[46158.954115]  schedule+0x27/0xf0
[46158.954123]  schedule_preempt_disabled+0x15/0x30
[46158.954131]  __mutex_lock.constprop.0+0x3d0/0x6d0
[46158.954140]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.954151]  dm_suspend+0xfd/0x260 [amdgpu]
[46158.954574]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.954581]  ? hdp_v5_2_update_clock_gating+0x219/0x370 [amdgpu]
[46158.954873]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.954880]  ? soc21_common_set_clockgating_state+0x80/0xb0 [amdgpu]
[46158.955154]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.955163]  amdgpu_device_ip_suspend_phase1+0x70/0xd0 [amdgpu]
[46158.955270]  amdgpu_device_ip_suspend+0x29/0x70 [amdgpu]
[46158.955363]  amdgpu_device_pre_asic_reset+0xe9/0x2c0 [amdgpu]
[46158.955457]  amdgpu_device_gpu_recover.cold+0x5cf/0xc17 [amdgpu]
[46158.955614]  amdgpu_job_timedout.cold+0x284/0x2c9 [amdgpu]
[46158.955769]  drm_sched_job_timedout+0x73/0x100 [gpu_sched]
[46158.955775]  process_one_work+0x174/0x330
[46158.955780]  worker_thread+0x252/0x390
[46158.955783]  ? __pfx_worker_thread+0x10/0x10
[46158.955786]  kthread+0xcf/0x100
[46158.955789]  ? __pfx_kthread+0x10/0x10
[46158.955792]  ret_from_fork+0x31/0x50
[46158.955796]  ? __pfx_kthread+0x10/0x10
[46158.955798]  ret_from_fork_asm+0x1a/0x30
[46158.955804]  </TASK>
[46158.955807] INFO: task kworker/u48:6:1933961 blocked for more than 120 seconds.
[46158.955810]       Tainted: G        W          6.12.12-amd64 #1 Debian 6.12.12-1
[46158.955817] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46158.955819] task:kworker/u48:6   state:D stack:0     pid:1933961 tgid:1933961 ppid:2      flags:0x00004000
[46158.955824] Workqueue: events_unbound commit_work [drm_kms_helper]
[46158.955833] Call Trace:
[46158.955835]  <TASK>
[46158.955837]  __schedule+0x3e6/0xbf0
[46158.955842]  schedule+0x27/0xf0
[46158.955845]  schedule_timeout+0x12f/0x160
[46158.955849]  wait_for_completion+0x8a/0x160
[46158.955854]  __flush_workqueue+0x155/0x420
[46158.955859]  amdgpu_dm_atomic_commit_tail+0x134d/0x3a00 [amdgpu]
[46158.956028]  ? __entry_text_end+0x101e86/0x101e89
[46158.956034]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956037]  ? dma_fence_default_wait+0x8c/0x260
[46158.956040]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[46158.956043]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956046]  ? kvfree_call_rcu+0x227/0x380
[46158.956049]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956051]  ? wait_for_completion_timeout+0x13b/0x170
[46158.956057]  commit_tail+0x91/0x130 [drm_kms_helper]
[46158.956065]  process_one_work+0x174/0x330
[46158.956069]  worker_thread+0x252/0x390
[46158.956072]  ? __pfx_worker_thread+0x10/0x10
[46158.956075]  kthread+0xcf/0x100
[46158.956078]  ? __pfx_kthread+0x10/0x10
[46158.956082]  ret_from_fork+0x31/0x50
[46158.956085]  ? __pfx_kthread+0x10/0x10
[46158.956087]  ret_from_fork_asm+0x1a/0x30
[46158.956092]  </TASK>
[46158.956098] INFO: task kworker/u48:5:1962342 blocked for more than 120 seconds.
[46158.956101]       Tainted: G        W          6.12.12-amd64 #1 Debian 6.12.12-1
[46158.956103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46158.956104] task:kworker/u48:5   state:D stack:0     pid:1962342 tgid:1962342 ppid:2      flags:0x00004000
[46158.956108] Workqueue: events_unbound commit_work [drm_kms_helper]
[46158.956115] Call Trace:
[46158.956117]  <TASK>
[46158.956119]  __schedule+0x3e6/0xbf0
[46158.956124]  schedule+0x27/0xf0
[46158.956127]  schedule_timeout+0x12f/0x160
[46158.956131]  wait_for_completion+0x8a/0x160
[46158.956135]  __flush_workqueue+0x155/0x420
[46158.956140]  amdgpu_dm_atomic_commit_tail+0x134d/0x3a00 [amdgpu]
[46158.956290]  ? __entry_text_end+0x101e86/0x101e89
[46158.956297]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956299]  ? queue_delayed_work_on+0x6f/0x80
[46158.956303]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956305]  ? kvfree_call_rcu+0x227/0x380
[46158.956307]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956310]  ? wait_for_completion_timeout+0x13b/0x170
[46158.956315]  commit_tail+0x91/0x130 [drm_kms_helper]
[46158.956323]  process_one_work+0x174/0x330
[46158.956327]  worker_thread+0x252/0x390
[46158.956330]  ? __pfx_worker_thread+0x10/0x10
[46158.956333]  kthread+0xcf/0x100
[46158.956336]  ? __pfx_kthread+0x10/0x10
[46158.956339]  ret_from_fork+0x31/0x50
[46158.956342]  ? __pfx_kthread+0x10/0x10
[46158.956344]  ret_from_fork_asm+0x1a/0x30
[46158.956349]  </TASK>
[46158.956351] INFO: task kworker/u48:1:1981087 blocked for more than 120 seconds.
[46158.956353]       Tainted: G        W          6.12.12-amd64 #1 Debian 6.12.12-1
[46158.956356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46158.956357] task:kworker/u48:1   state:D stack:0     pid:1981087 tgid:1981087 ppid:2      flags:0x00004000
[46158.956361] Workqueue: dm_vblank_control_workqueue amdgpu_dm_crtc_vblank_control_worker [amdgpu]
[46158.956505] Call Trace:
[46158.956507]  <TASK>
[46158.956509]  __schedule+0x3e6/0xbf0
[46158.956514]  schedule+0x27/0xf0
[46158.956517]  schedule_preempt_disabled+0x15/0x30
[46158.956520]  __mutex_lock.constprop.0+0x3d0/0x6d0
[46158.956523]  ? try_to_wake_up+0x7d/0x680
[46158.956527]  amdgpu_dm_crtc_vblank_control_worker+0x25/0x1f0 [amdgpu]
[46158.956666]  process_one_work+0x174/0x330
[46158.956670]  worker_thread+0x252/0x390
[46158.956674]  ? __pfx_worker_thread+0x10/0x10
[46158.956683]  kthread+0xcf/0x100
[46158.956686]  ? __pfx_kthread+0x10/0x10
[46158.956688]  ret_from_fork+0x31/0x50
[46158.956691]  ? __pfx_kthread+0x10/0x10
[46158.956693]  ret_from_fork_asm+0x1a/0x30
[46158.956698]  </TASK>
[46158.956716] INFO: task cat:2003107 blocked for more than 120 seconds.
[46158.956719]       Tainted: G        W          6.12.12-amd64 #1 Debian 6.12.12-1
[46158.956721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46158.956723] task:cat             state:D stack:0     pid:2003107 tgid:2003107 ppid:2003103 flags:0x00000002
[46158.956726] Call Trace:
[46158.956728]  <TASK>
[46158.956730]  __schedule+0x3e6/0xbf0
[46158.956735]  schedule+0x27/0xf0
[46158.956738]  schedule_timeout+0x12f/0x160
[46158.956742]  wait_for_completion+0x8a/0x160
[46158.956746]  __flush_work+0x269/0x350
[46158.956749]  ? __pfx_wq_barrier_func+0x10/0x10
[46158.956753]  gpu_recover_get+0x88/0x90 [amdgpu]
[46158.956852]  simple_attr_read+0x6b/0x120
[46158.956857]  debugfs_attr_read+0x3e/0x70
[46158.956861]  full_proxy_read+0x4e/0x90
[46158.956864]  vfs_read+0xe8/0x370
[46158.956868]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956870]  ? __handle_mm_fault+0xb1a/0xfa0
[46158.956876]  ksys_read+0x6d/0xf0
[46158.956879]  do_syscall_64+0x82/0x190
[46158.956884]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956886]  ? __count_memcg_events+0x53/0xf0
[46158.956889]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956892]  ? count_memcg_events.constprop.0+0x1a/0x30
[46158.956894]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956897]  ? handle_mm_fault+0x1bb/0x2c0
[46158.956900]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956903]  ? do_user_addr_fault+0x36c/0x620
[46158.956907]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956909]  ? srso_alias_return_thunk+0x5/0xfbef5
[46158.956912]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[46158.956916] RIP: 0033:0x7f81df41b59d
[46158.956940] RSP: 002b:00007ffd41916fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[46158.956944] RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f81df41b59d
[46158.956946] RDX: 0000000000040000 RSI: 00007f81defbf000 RDI: 0000000000000003
[46158.956948] RBP: 0000000000040000 R08: 00007f81df570480 R09: 0000000000000000
[46158.956950] R10: 0000000000000003 R11: 0000000000000246 R12: 00007f81defbf000
[46158.956953] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000
[46158.956958]  </TASK>
  • white screen crash
[ 2286.937404] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[ 2286.941460] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2286.941528] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2287.164640] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[ 2287.165466] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2287.165535] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2287.391910] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[ 2287.393330] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2287.393357] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2297.449877] amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:79:crtc-0] flip_done timed out
[ 2297.454134] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2297.454430] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2307.690251] amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out
[ 2307.690276] amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:79:crtc-0] commit wait timed out
[ 2307.693525] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2307.693609] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2307.694018] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2307.694078] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2317.929659] amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out
[ 2317.929692] amdgpu 0000:c1:00.0: [drm] *ERROR* [CONNECTOR:93:eDP-1] commit wait timed out
[ 2317.933366] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2317.933410] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2317.933647] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2317.933729] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2328.169897] amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out
[ 2328.169920] amdgpu 0000:c1:00.0: [drm] *ERROR* [PLANE:58:plane-3] commit wait timed out
[ 2328.170003] ------------[ cut here ]------------
[ 2328.170009] WARNING: CPU: 5 PID: 1431 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:9205 amdgpu_dm_atomic_commit_tail+0x392f/0x3a00 [amdgpu]
[ 2328.170439] Modules linked in: tun netconsole snd_seq_dummy snd_seq_midi snd_seq_midi_event snd_hrtimer snd_seq ccm algif_aead crypto_null des3_ede_x86_64 des_generic libdes md4 qrtr rfcomm cmac algif_hash algif_skcipher af_alg bnep amd_atl intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76 mac80211 snd_sof_amd_rembrandt btusb snd_sof_amd_acp btrtl snd_sof_pci aesni_intel btintel snd_sof_xtensa_dsp gf128mul libarc4 hid_sensor_als btbcm crypto_simd snd_sof snd_hda_codec_realtek hid_sensor_trigger btmtk snd_hda_codec_generic hid_sensor_iio_common cryptd cros_usbpd_charger leds_cros_ec snd_sof_utils bluetooth cfg80211 snd_hda_scodec_component snd_hda_codec_hdmi rapl led_class_multicolor cros_ec_sysfs cros_usbpd_logger industrialio_triggered_buffer cros_kbd_led_backlight pcspkr snd_hda_intel cros_usbpd_notify cros_ec_debugfs cros_ec_hwmon kfifo_buf cros_ec_chardev cros_charge_control
[ 2328.170492]  snd_soc_core binfmt_misc snd_intel_dspcfg wmi_bmof industrialio snd_usb_audio snd_intel_sdw_acpi spd5118 snd_compress snd_pcm_dmaengine snd_hda_codec k10temp sp5100_tco snd_pci_ps watchdog snd_rpl_pci_acp6x snd_pci_acp6x snd_usbmidi_lib snd_pci_acp5x snd_hda_core snd_rawmidi snd_rn_pci_acp3x snd_acp_config snd_seq_device snd_soc_acpi mc snd_pci_acp3x snd_hwdep snd_pcm snd_timer snd soundcore rfkill ucsi_acpi typec_ucsi nls_ascii nls_cp437 typec vfat roles fat amd_pmf ac amdtee ccp amd_sfh tee platform_profile amd_pmc hid_waltop joydev serio_raw evdev pkcs8_key_parser msr parport_pc ppdev lp nvme_fabrics parport configfs nvme_keyring efi_pstore nfnetlink efivarfs ip_tables x_tables autofs4 ext4 mbcache jbd2 crc32c_generic hid_logitech ff_memless hid_logitech_hidpp hid_logitech_dj r8153_ecm cdc_ether usbnet r8152 mii usbhid libphy amdgpu amdxcp drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_display_helper hid_multitouch hid_sensor_hub cec rc_core drm_ttm_helper hid_generic ttm
[ 2328.170559]  i2c_hid_acpi xhci_pci i2c_hid cros_ec_dev xhci_hcd drm_kms_helper hid cros_ec_lpcs nvme cros_ec thunderbolt usbcore nvme_core crc32_pclmul crc32c_intel drm i2c_piix4 i2c_smbus usb_common crc16 nvme_auth video button battery wmi
[ 2328.170583] CPU: 5 UID: 0 PID: 1431 Comm: Xorg Tainted: G        W          6.12.12-amd64 #1  Debian 6.12.12-1
[ 2328.170587] Tainted: [W]=WARN
[ 2328.170589] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP05, BIOS 03.05 03/29/2024
[ 2328.170592] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x392f/0x3a00 [amdgpu]
[ 2328.170754] Code: d0 80 5e c1 e8 02 ca 88 ff e9 20 fe ff ff 49 8d 87 40 31 04 00 c6 85 38 fe ff ff 00 48 89 85 48 fe ff ff e9 f6 cc ff ff 0f 0b <0f> 0b e9 64 f3 ff ff 0f 0b e9 28 cd ff ff 0f 0b e9 75 f3 ff ff 48
[ 2328.170757] RSP: 0018:ffffb43f42987848 EFLAGS: 00010002
[ 2328.170760] RAX: 0000000000000246 RBX: 0000000000000246 RCX: ffff9cf3b89a3118
[ 2328.170762] RDX: 0000000000000001 RSI: 0000000000000297 RDI: ffff9cf391c80178
[ 2328.170764] RBP: ffffb43f42987a90 R08: ffffb43f42987734 R09: 0000000000000000
[ 2328.170766] R10: ffffb43f429877a0 R11: ffffb43f429877a4 R12: 0000000000000002
[ 2328.170768] R13: 0000000000000000 R14: ffff9cf5e4f68200 R15: ffff9cf3b89a3000
[ 2328.170770] FS:  00007f5cda61ab00(0000) GS:ffff9cfa5e680000(0000) knlGS:0000000000000000
[ 2328.170773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2328.170775] CR2: 000012cc05caf000 CR3: 0000000168ca6000 CR4: 0000000000f50ef0
[ 2328.170777] PKRU: 55555554
[ 2328.170779] Call Trace:
[ 2328.170783]  <TASK>
[ 2328.170785]  ? amdgpu_dm_atomic_commit_tail+0x392f/0x3a00 [amdgpu]
[ 2328.170923]  ? __warn.cold+0x93/0xf6
[ 2328.170928]  ? amdgpu_dm_atomic_commit_tail+0x392f/0x3a00 [amdgpu]
[ 2328.171067]  ? report_bug+0xff/0x140
[ 2328.171071]  ? handle_bug+0x58/0x90
[ 2328.171074]  ? exc_invalid_op+0x17/0x70
[ 2328.171077]  ? asm_exc_invalid_op+0x1a/0x20
[ 2328.171082]  ? amdgpu_dm_atomic_commit_tail+0x392f/0x3a00 [amdgpu]
[ 2328.171213]  ? amdgpu_dm_atomic_commit_tail+0x2c87/0x3a00 [amdgpu]
[ 2328.171350]  ? __entry_text_end+0x101e86/0x101e89
[ 2328.171359]  commit_tail+0x91/0x130 [drm_kms_helper]
[ 2328.171368]  drm_atomic_helper_commit+0x11a/0x140 [drm_kms_helper]
[ 2328.171376]  drm_atomic_commit+0xa6/0xe0 [drm]
[ 2328.171390]  ? __pfx___drm_printfn_info+0x10/0x10 [drm]
[ 2328.171403]  drm_atomic_helper_set_config+0x74/0xb0 [drm_kms_helper]
[ 2328.171410]  drm_mode_setcrtc+0x46c/0x8a0 [drm]
[ 2328.171426]  ? __pfx_drm_syncobj_wait_ioctl+0x10/0x10 [drm]
[ 2328.171443]  ? __pfx_drm_mode_setcrtc+0x10/0x10 [drm]
[ 2328.171456]  drm_ioctl_kernel+0xad/0x100 [drm]
[ 2328.171473]  drm_ioctl+0x277/0x4f0 [drm]
[ 2328.171488]  ? __pfx_drm_mode_setcrtc+0x10/0x10 [drm]
[ 2328.171503]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
[ 2328.171594]  __x64_sys_ioctl+0x91/0xd0
[ 2328.171598]  do_syscall_64+0x82/0x190
[ 2328.171603]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.171606]  ? syscall_exit_to_user_mode+0x172/0x210
[ 2328.171608]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.171611]  ? do_syscall_64+0x8e/0x190
[ 2328.171614]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.171617]  ? syscall_exit_to_user_mode+0x172/0x210
[ 2328.171619]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.171622]  ? do_syscall_64+0x8e/0x190
[ 2328.171625]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.171628]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 2328.171633] RIP: 0033:0x7f5cda9ba37b
[ 2328.171657] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 2328.171660] RSP: 002b:00007ffff74aadd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2328.171663] RAX: ffffffffffffffda RBX: 0000556fbc119ed0 RCX: 00007f5cda9ba37b
[ 2328.171665] RDX: 00007ffff74aae60 RSI: 00000000c06864a2 RDI: 000000000000000f
[ 2328.171667] RBP: 00007ffff74aae60 R08: 0000000000000190 R09: 0000556fbd38efd0
[ 2328.171669] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c06864a2
[ 2328.171670] R13: 000000000000000f R14: 0000556fba7630d0 R15: 0000556fbaa98f10
[ 2328.171674]  </TASK>
[ 2328.171676] ---[ end trace 0000000000000000 ]---
[ 2328.171686] ------------[ cut here ]------------
[ 2328.171688] WARNING: CPU: 5 PID: 1431 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8591 amdgpu_dm_atomic_commit_tail+0x393d/0x3a00 [amdgpu]
[ 2328.171844] Modules linked in: tun netconsole snd_seq_dummy snd_seq_midi snd_seq_midi_event snd_hrtimer snd_seq ccm algif_aead crypto_null des3_ede_x86_64 des_generic libdes md4 qrtr rfcomm cmac algif_hash algif_skcipher af_alg bnep amd_atl intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76 mac80211 snd_sof_amd_rembrandt btusb snd_sof_amd_acp btrtl snd_sof_pci aesni_intel btintel snd_sof_xtensa_dsp gf128mul libarc4 hid_sensor_als btbcm crypto_simd snd_sof snd_hda_codec_realtek hid_sensor_trigger btmtk snd_hda_codec_generic hid_sensor_iio_common cryptd cros_usbpd_charger leds_cros_ec snd_sof_utils bluetooth cfg80211 snd_hda_scodec_component snd_hda_codec_hdmi rapl led_class_multicolor cros_ec_sysfs cros_usbpd_logger industrialio_triggered_buffer cros_kbd_led_backlight pcspkr snd_hda_intel cros_usbpd_notify cros_ec_debugfs cros_ec_hwmon kfifo_buf cros_ec_chardev cros_charge_control
[ 2328.171890]  snd_soc_core binfmt_misc snd_intel_dspcfg wmi_bmof industrialio snd_usb_audio snd_intel_sdw_acpi spd5118 snd_compress snd_pcm_dmaengine snd_hda_codec k10temp sp5100_tco snd_pci_ps watchdog snd_rpl_pci_acp6x snd_pci_acp6x snd_usbmidi_lib snd_pci_acp5x snd_hda_core snd_rawmidi snd_rn_pci_acp3x snd_acp_config snd_seq_device snd_soc_acpi mc snd_pci_acp3x snd_hwdep snd_pcm snd_timer snd soundcore rfkill ucsi_acpi typec_ucsi nls_ascii nls_cp437 typec vfat roles fat amd_pmf ac amdtee ccp amd_sfh tee platform_profile amd_pmc hid_waltop joydev serio_raw evdev pkcs8_key_parser msr parport_pc ppdev lp nvme_fabrics parport configfs nvme_keyring efi_pstore nfnetlink efivarfs ip_tables x_tables autofs4 ext4 mbcache jbd2 crc32c_generic hid_logitech ff_memless hid_logitech_hidpp hid_logitech_dj r8153_ecm cdc_ether usbnet r8152 mii usbhid libphy amdgpu amdxcp drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_display_helper hid_multitouch hid_sensor_hub cec rc_core drm_ttm_helper hid_generic ttm
[ 2328.171959]  i2c_hid_acpi xhci_pci i2c_hid cros_ec_dev xhci_hcd drm_kms_helper hid cros_ec_lpcs nvme cros_ec thunderbolt usbcore nvme_core crc32_pclmul crc32c_intel drm i2c_piix4 i2c_smbus usb_common crc16 nvme_auth video button battery wmi
[ 2328.171986] CPU: 5 UID: 0 PID: 1431 Comm: Xorg Tainted: G        W          6.12.12-amd64 #1  Debian 6.12.12-1
[ 2328.171991] Tainted: [W]=WARN
[ 2328.171993] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP05, BIOS 03.05 03/29/2024
[ 2328.171996] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x393d/0x3a00 [amdgpu]
[ 2328.172175] Code: 49 8d 87 40 31 04 00 c6 85 38 fe ff ff 00 48 89 85 48 fe ff ff e9 f6 cc ff ff 0f 0b 0f 0b e9 64 f3 ff ff 0f 0b e9 28 cd ff ff <0f> 0b e9 75 f3 ff ff 48 c7 85 30 fe ff ff 00 00 00 00 48 c7 85 f8
[ 2328.172180] RSP: 0018:ffffb43f42987848 EFLAGS: 00010082
[ 2328.172183] RAX: 0000000000000001 RBX: 0000000000000246 RCX: ffff9cf3b89a3118
[ 2328.172185] RDX: 0000000000000001 RSI: 0000000000000297 RDI: ffff9cf391c80178
[ 2328.172188] RBP: ffffb43f42987a90 R08: ffffb43f42987734 R09: 0000000000000000
[ 2328.172190] R10: ffffb43f429877a0 R11: ffffb43f429877a4 R12: 0000000000000002
[ 2328.172192] R13: 0000000000000000 R14: ffff9cf5e4f68200 R15: ffff9cf3b89a3000
[ 2328.172196] FS:  00007f5cda61ab00(0000) GS:ffff9cfa5e680000(0000) knlGS:0000000000000000
[ 2328.172198] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2328.172200] CR2: 000012cc05caf000 CR3: 0000000168ca6000 CR4: 0000000000f50ef0
[ 2328.172202] PKRU: 55555554
[ 2328.172204] Call Trace:
[ 2328.172206]  <TASK>
[ 2328.172208]  ? amdgpu_dm_atomic_commit_tail+0x393d/0x3a00 [amdgpu]
[ 2328.172348]  ? __warn.cold+0x93/0xf6
[ 2328.172351]  ? amdgpu_dm_atomic_commit_tail+0x393d/0x3a00 [amdgpu]
[ 2328.172486]  ? report_bug+0xff/0x140
[ 2328.172490]  ? handle_bug+0x58/0x90
[ 2328.172492]  ? exc_invalid_op+0x17/0x70
[ 2328.172495]  ? asm_exc_invalid_op+0x1a/0x20
[ 2328.172499]  ? amdgpu_dm_atomic_commit_tail+0x393d/0x3a00 [amdgpu]
[ 2328.172633]  ? amdgpu_dm_atomic_commit_tail+0x2c87/0x3a00 [amdgpu]
[ 2328.172775]  ? __entry_text_end+0x101e86/0x101e89
[ 2328.172784]  commit_tail+0x91/0x130 [drm_kms_helper]
[ 2328.172793]  drm_atomic_helper_commit+0x11a/0x140 [drm_kms_helper]
[ 2328.172802]  drm_atomic_commit+0xa6/0xe0 [drm]
[ 2328.172825]  ? __pfx___drm_printfn_info+0x10/0x10 [drm]
[ 2328.172838]  drm_atomic_helper_set_config+0x74/0xb0 [drm_kms_helper]
[ 2328.172845]  drm_mode_setcrtc+0x46c/0x8a0 [drm]
[ 2328.172861]  ? __pfx_drm_syncobj_wait_ioctl+0x10/0x10 [drm]
[ 2328.172877]  ? __pfx_drm_mode_setcrtc+0x10/0x10 [drm]
[ 2328.172890]  drm_ioctl_kernel+0xad/0x100 [drm]
[ 2328.172907]  drm_ioctl+0x277/0x4f0 [drm]
[ 2328.172920]  ? __pfx_drm_mode_setcrtc+0x10/0x10 [drm]
[ 2328.172936]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
[ 2328.173027]  __x64_sys_ioctl+0x91/0xd0
[ 2328.173030]  do_syscall_64+0x82/0x190
[ 2328.173034]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.173037]  ? syscall_exit_to_user_mode+0x172/0x210
[ 2328.173040]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.173042]  ? do_syscall_64+0x8e/0x190
[ 2328.173045]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.173048]  ? syscall_exit_to_user_mode+0x172/0x210
[ 2328.173050]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.173052]  ? do_syscall_64+0x8e/0x190
[ 2328.173056]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2328.173059]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 2328.173062] RIP: 0033:0x7f5cda9ba37b
[ 2328.173067] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 2328.173070] RSP: 002b:00007ffff74aadd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2328.173073] RAX: ffffffffffffffda RBX: 0000556fbc119ed0 RCX: 00007f5cda9ba37b
[ 2328.173075] RDX: 00007ffff74aae60 RSI: 00000000c06864a2 RDI: 000000000000000f
[ 2328.173077] RBP: 00007ffff74aae60 R08: 0000000000000190 R09: 0000556fbd38efd0
[ 2328.173079] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c06864a2
[ 2328.173081] R13: 000000000000000f R14: 0000556fba7630d0 R15: 0000556fbaa98f10
[ 2328.173085]  </TASK>
[ 2328.173087] ---[ end trace 0000000000000000 ]---
[ 2328.173500] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2328.173555] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2328.173933] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2328.173978] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2328.400808] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data

(...repeats ~11 more times, always amdgpu_dm.c:9205 and amdgpu_dm.c:8591...)
(...followed by successful, manually triggered GPU recovery...)

[ 2899.817762] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2899.817831] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2899.818220] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2899.818277] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2911.846056] amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:79:crtc-0] flip_done timed out
[ 2911.849650] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2911.849719] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2917.785566] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 2917.789556] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2917.789626] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2918.692201] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - mpc2_assert_idle_mpcc line:481
[ 2918.693561] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2918.693630] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2919.257537] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1000us * 30 tries - dcn31_wait_for_det_apply line:118
[ 2919.261562] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2919.261633] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2919.484657] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc314_disable_crtc line:145
[ 2919.485493] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2919.485560] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2919.824403] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 10us * 10020 tries - enc1_stream_encoder_dp_blank line:943
[ 2919.825477] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2919.825530] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2920.734609] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 2920.737784] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2920.738039] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2920.764804] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 2920.765723] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2920.765804] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2920.765833] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[ 2920.765884] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 2920.766239] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2920.766279] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2920.767965] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 2920.769474] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2920.769541] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2920.775559] [drm] DMUB hardware initialized: version=0x08004800
[ 2921.158121] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc314_disable_crtc line:145
[ 2921.161712] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2921.161814] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2921.883855] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 2921.883886] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2921.883891] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2921.883895] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 2921.883899] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 2921.883902] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 2921.883905] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 2921.883909] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 2921.883913] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 2921.883916] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 2921.883919] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 2921.883922] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 2921.883924] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 2921.885471] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2921.885510] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2921.885788] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
(...repeats...)
[ 2921.887316] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2921.887433] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2921.887460] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2921.888097] amdgpu 0000:c1:00.0: amdgpu: GPU reset(1) succeeded!
[ 2921.889297] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2921.889324] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".

  • black screen: not very useful; it’s just a bunch of GPU reset attempts (possibly manually triggered) that claim to be successful. No previous GPU-related errors.
[427892.572872] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[427892.580143] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427892.580236] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427892.709016] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[427892.714007] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427892.714385] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427892.740764] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[427892.741539] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427892.741562] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427892.741861] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[427892.741901] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[427892.742503] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[427892.745241] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427892.745300] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427892.745583] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427892.745646] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427892.748911] [drm] DMUB hardware initialized: version=0x08004800
[427893.453246] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[427893.453257] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[427893.453261] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[427893.453263] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[427893.453266] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[427893.453268] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[427893.453271] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[427893.453273] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[427893.453276] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[427893.453278] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[427893.453281] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[427893.453283] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[427893.453286] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[427893.453624] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427893.453654] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
(...repeats...)
[427893.454772] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427893.454808] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427893.454875] amdgpu 0000:c1:00.0: amdgpu: GPU reset(1) succeeded!
[427893.454916] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427893.454954] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
(...repeats...)
[427893.455761] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427893.455787] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427899.600088] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[427899.601972] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427899.602141] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427899.743682] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[427899.745255] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427899.745323] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427899.773784] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[427899.774995] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[427899.775035] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[427899.776919] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[427899.777235] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427899.777303] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427899.777667] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427899.777733] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427899.777946] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427899.777971] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427899.784286] [drm] DMUB hardware initialized: version=0x08004800
[427900.486689] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[427900.486704] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[427900.486708] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[427900.486711] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[427900.486714] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[427900.486717] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[427900.486720] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[427900.486724] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[427900.486727] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[427900.486729] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[427900.486732] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[427900.486735] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[427900.486737] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[427900.489353] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427900.489370] amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) succeeded!
[427900.489439] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427900.489894] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
(...repeats...)
[427900.492130] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427905.421252] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[427905.425424] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427905.425471] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427905.558484] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[427905.561396] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427905.561500] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427905.589580] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[427905.590713] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[427905.590751] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[427905.592130] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[427905.593055] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427905.593098] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427905.593377] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427905.593444] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427905.593694] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[427905.593745] systemd-journald[444]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[427905.599769] [drm] DMUB hardware initialized: version=0x08004800
[427906.303134] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[427906.303150] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[427906.303156] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[427906.303161] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[427906.303180] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[427906.303184] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[427906.303189] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[427906.303193] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[427906.303198] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[427906.303203] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[427906.303207] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[427906.303212] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[427906.303216] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[427906.305070] amdgpu 0000:c1:00.0: amdgpu: GPU reset(3) succeeded!
(...or so it claims; still stuck at black screen...)

  • sudden freeze? Not sure.
[341578.261844] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770)
[341578.261862] amdgpu 0000:c1:00.0: amdgpu:  in process Xorg pid 1426 thread Xorg:cs0 pid 1469)
[341578.261870] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000800101c51000 from client 10 
[341578.261877] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501030
[341578.261883] amdgpu 0000:c1:00.0: amdgpu:     Faulty UTCL2 client ID: TCP (0x8)
[341578.261889] amdgpu 0000:c1:00.0: amdgpu:     MORE_FAULTS: 0x0
[341578.261894] amdgpu 0000:c1:00.0: amdgpu:     WALKER_ERROR: 0x0 
[341578.261899] amdgpu 0000:c1:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[341578.261904] amdgpu 0000:c1:00.0: amdgpu:     MAPPING_ERROR: 0x0
[341578.261909] amdgpu 0000:c1:00.0: amdgpu:     RW: 0x0 
[341578.262691] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[341578.262736] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
(...repeats...)
[341578.264551] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[341578.264569] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[341588.490365] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[341588.493039] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[341588.493452] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
[341588.494519] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[341588.494560] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[341588.494907] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[341588.494931] systemd-journald[460]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
1 Like

Hi.

It looks like most of the problems you see are related to the amdgpu driver in Linux.
Framework don’t write that bit of software.
The best hope for fixes to the amdgpu driver is by reporting the problems here:

AMD people read and fix bugs there.

1 Like

You should probably add the framework-laptop-13-amd tag.

I am sorry to hear you have this experience, I’d be very angry to have a device betray me with unpredictable outages, and likely chasing a replacement board. I would like to believe it’s not the platform, but you’re right to say that there’s a swathe of people experiencing problems like yours, which sucks. My anecdotal experience is different to yours, I’ve only had crashes (with USB power vs networking and Samba) in the 15 months I’ve been running Debian Trixie and KDE Plasma.

Have you run a memtest to rule out duff memory? What else have you done to pinpoint a diagnosis on the CPU or GPU?

What else do you know about this syndrome of 'failed to chase symlinks? I’m going to infer from the /sys/ device heirarchy that systemd is trying to find the endpoint to send a message or read from a device that’s no longer registered or no longer responding.

[ 2317.933366] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2317.933410] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".
[ 2317.933647] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/class/pci/0000:c1:00.0".
[ 2317.933729] systemd-journald[441]: sd-device: Failed to chase symlinks in "/sys/firmware/pci/0000:c1:00.0".

If the power management is being aggressive, you can stop the ‘active state power management’ for PCIe devices with a kernel commandline option: pcie.aspm=off.

Again, I am sorry to hear you have this experience and hope that this helps you find some resolution.

K3n.

My first bet would definitely be amdgpu kernel module bug rather than a hardware problem.

@James3 and @Morgwai Thanks; when I have more time I’ll try to search through the linked issue tracker if I can match up some of the crashes with existing issues (and perhaps report the others, if they happen to be those I have more data on).

In the past I think I also saw threads with amdgpu-related log messages that turned out to be kernel issues, and in one of the freezing/crashing threads I remember seeing Mario (one of the devs) mention “this looks like Mesa issue” or something along the lines. So I’m not confident what’s the real origin of the crashes (or if amdgpu is split into userspace and kernel parts or something – i.e. there may be no real distinction between “kernel” and “driver”). But I suppose it’s better to report the issues to the wrong place than not reporting at all.

@Brian_Gregory The post is in Framework Laptop 13 → Linux forum, and the only allowed tags are distribution names. I also wanted to add freezing-crashing, amd, linux tags, but it would not let me.

@Kenny_Lewis

I’d be very angry to have a device betray me with unpredictable outages, and likely chasing a replacement board

Sure, it is pretty annoying, but I suppose I kind of got used to it. I tend to have about 4 workspaces full of open terminals and other windows, which makes recovering from each crash extra painful – but I just ended up writing scripts that open everything at the right place and size. It shouldn’t be necessary, but at least it is bearable now.

I’m not “chasing a replacement board”, because, as I mentioned, I had a very long run without any crashes. Since the crashes now repeat at least once or twice a week, if not more, it seems very unlikely the hardware would just randomly decide to behave several months in a row if it was really faulty. Memtest86+ also shows no errors (though I waited only for one full run; maybe I should also try an overnight test, given how infrequent the crashes can be).

Not sure about the symlinks; neither /sys/class/pci/ or /sys/firmware/ exist on my system, so maybe systemd just assumes some directory structure that is not valid on Debian. I did not experiment with PCIe power management; I’ll take a look at that (I’m plugged in most of the time so higher consumption would not be an issue).

Anyway; everything seems to be pointing to a SW update that started the recent issues, so trying to get a replacement board would likely just waste not only time of Framework support staff, but also mine (it’s my work laptop, so I can’t just unplug all expansion cards, install an officially supported distribution, and wait two weeks to see if it happens to crash).

This one, there is an option in CMOS / EFI setup, the option sounds like PCIE power save or something like that. turn it off, then PCIE will not drop to gen1.

1 Like

Hi,

I have the same issue: the screen freezes, I cannot move my mouse, but I can ssh into the laptop. It started a week or 2 ago after a big and long due Debian update. For me, the crashes are rather frequent: they happen every time after 2-3 hours of work (and I usually don’t do any intensive work). So far I haven’t tried manual gpu reset, because I didn’t know how to do it, but I’ll certainly try it next time I encounter the bug!

Same to you, I’m trying to pinpoint the system which fails, but so far the amdgpu driver looks to be the most likely.

For reference, my demsg output:

amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:79:crtc-0] flip_done timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:79:crtc-0] commit wait timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* [CONNECTOR:93:eDP-1] commit wait timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* [PLANE:58:plane-3] commit wait timed out
amdgpu 0000:c1:00.0: [drm] *ERROR* flip_done timed out

EDIT:

I think this issue might be relevant (contains a w/a and possible fix, which should be included in 6.14-rc2): https://gitlab.freedesktop.org/drm/amd/-/issues/3925#note_2773473

I tried disabling the power management at runtime, but it did not make any difference – other than keeping the drive quite hot even when idle. I had no idea the PCIe link itself is so power hungry; I thought most of the power would be consumed by the flash and memory controller activity…

While power consumption isn’t an issue for me, I did not want to constantly toast the drive, so I switched power management back on (besides, it is staying at full speed for days after boot, and then suddenly gets permanently stuck at degraded speed, which does not sound like a power saving behavior at all). But thanks for mentioning it; I did not notice there is a system-wide option in the EFI menu.

I think the laptop is just trolling me at this point, because since the moment I started complaining, I did not have a single crash (more than 2 weeks now)… I think that one of the last things I did was setting the VRAM allocation mode back from “auto” to “game optimized”, which was being recommended as a workaround to other issues last year.

As I currently get no crashes I do not dare to make any updates or changes, so I did not really progress with the debugging. But I’m starting to think the problems may be occurring mainly under memory pressure: apart from enabling the “game optimized” mode, I’ve been also trying to keep more free RAM lately – so either of the two changes could be the reason why I had no crashes the past two weeks.

Coincidentally, the NVMe PCIe link speed degradation often happened around 4 AM, when my rsync backup is running – and rsync is known to consume a lot of RAM when transferring many small files. So perhaps the SSD issues are related after all…

The “auto” VRAM mode could probably make it worse, since pre-allocates only 512 MB to the GPU, and the rest is allocated dynamically – and if the system runs out of RAM, it wouldn’t surprise me if amdgpu could not handle a denied or delayed memory allocation request and crashes as a result…

If a full memory really is what “enables” the crashes, maybe a good permanent workaround would be to simply get so much RAM that I never fill it completely, and / or setting vm.overcommit_memory=2 and stuff in a way so that it can’t be completely filled even by accident. But then again, the “white screen” crash happened soon after boot, when most of my RAM was still empty, so that’s one point against the whole theory. So, who knows…

Ah, that’s a great find – although a log I have from one of the “black screen” freezes shows that I was a) running unaffected kernel version (6.12.11), and I had the amdgpu.dcdebugmask=0x10 kernel parameter set as a workaround for another 6.12 kernel issue (dropped frames / cursor jerkiness). So it may be another type of crash I did not personally see yet (dmesg also looks different from all of mine). Hope it helps you though! Two or three crashes every day sounds really bad.

OK, finally some good news (for me, anyway): I found the cause of “black screen freeze” and a way to recover from it.

For some unknowable reason I’m not allowed to edit my own older posts, so I cannot update the original as I promised; hopefully people find it here…

TL;DR workaround / recovery from the black screen (connected via SSH):

sudo apt remove xscreensaver
killall xscreensaver
export DISPLAY=:0
xset dpms force off

Details:

The “black screen” freeze has something to do with xscreensaver: I refrained from using it for the past few weeks (as well as limiting the RAM usage as mentioned earlier), but I needed to lock the screen over this weekend – and when I came back, it was stuck on a black screen again.

Since the screen lock was the only difference in my recent activity, I now knew where to look, and soon found out that 1) killing, restarting or removing xscreensaver does not help, and 2) after starting it again and trying to activate it from command line, it runs into some issue:

xscreensaver-command: blanking
xscreensaver-command: waiting for status change
xscreensaver-command: read status property: 0x3f8: 0, 1743357497, 0, 0
xscreensaver-command: read status property: 0x3f8: 0, 1743357497, 0, 0
xscreensaver-command: read status property: 0x3f8: 0, 1743357497, 0, 0
... (repeats)
xscreensaver-command: Timed out waiting for screen to blank

Clearly xscreensaver has some trouble talking to the GPU or X-server, telling it to blank the screen, so it would not take much imagination to come to the conclusion the opposite is also true, and the “black screen freeze” is simply the screen being stuck in a blanked state – which explains why the cursor keeps moving and even changes shape to text cursor / resize / hyperlinks as I move it around: the GPU is fine and all the applications are running, it’s just all stuck behind some black overlay.

I tried to find out how xscreensaver performs the blanking in hope it could be reverted in some other way, and while xset may not be exactly what’s used by xscreensaver internally, running xset dpms force off seems to be enough to shake off whatever weird state the system was in and get the running session back into a working state.

After recovering from the stuck state, the obvious preventive measure is to use a different screen locking tool – assuming they are not suffering from the same issue (remains to be seen I suppose).

1 Like