FW13, AMDGPU, linux: crashes on resume

gasche · October 7, 2024, 7:36am

I experience crashes with my Framework 13 machine (Ryzen 7040, AMD GPU, Linux (Fedora 40)), maybe once or twice a week. I have not managed to find the root cause of the issue (there is not much in the logs), so I am posting here in case some people more familiar with this kind of troubleshooting can help.

I have an up-to-date bios (3.05, no more recent version). I am using Fedora 40 with Wayland and Gnome.

The crashes happen either directly on resume, or when plugging to an external monitor (shortly after resume). The gnome-session crashes and restarts, I don’t need to halt or reboot the machine.

In the logs

There is nothing particularly relevant in the dmesg output, and I failed to identify a root cause in journalctl -x -n 3000. I can see that all graphical clients are killed due to a loss of display device, for example (there are other such events):

Oct 07 08:54:08 framawork thunderbird[1663793]: Error reading events from display: Broken pipe
Oct 07 08:54:08 framawork WebExtensions[1663927]: Error reading events from display: Broken pipe
Oct 07 08:54:08 framawork twinkle[1692808]: Error reading events from display: Broken pipe
Oct 07 08:54:08 framawork akonadi_maildir[1657715]: Error reading events from display: Broken pipe
Oct 07 08:54:08 framawork evolution-alarm[1656927]: Error reading events from display: Broken pipe
Oct 07 08:54:08 framawork gsd-power[1656906]: Error reading events from display: Broken pipe

Before that, there are other error messages from processes that also suggest a display issue:

Oct 07 08:54:06 framawork kalendarac[1657015]: qt.qpa.wayland: Creating a fake screen in order for Qt not to crash
Oct 07 08:54:06 framawork gnome-shell[1658403]: [Parent 1658403, Main Thread] WARNING: Couldn't map window 0x7f1b94deff40 as subsurface because its parent is not mapped.: 'glib warning', file /builddir/build/BUILD/firefox-129.0.2/toolkit/xre/nsSigHandlers.cpp:187
Oct 07 08:54:06 framawork firefox[1658403]: Couldn't map window 0x7f1b94deff40 as subsurface because its parent is not mapped.
Oct 07 08:54:06 framawork twinkle[1692808]: qt.qpa.wayland: Creating a fake screen in order for Qt not to crash
Oct 07 08:54:06 framawork thunderbird[1663793]: Couldn't map window 0x7f2a537f5ae0 as subsurface because its parent is not mapped.
Oct 07 08:54:06 framawork gnome-shell[1658403]: [Parent 1658403, Main Thread] WARNING: Couldn't map window 0x7f1b94deff40 as subsurface because its parent is not mapped.: 'glib warning', file /builddir/build/BUILD/firefox-129.0.2/toolkit/xre/nsSigHandlers.cpp:187

I suspect an AMDGPU issue, but I cannot find any clearly relevant line in the log shortly before the crash, except possibly the following one:

Oct 07 08:54:05 framawork kernel: ucsi_acpi USBC000:00: UCSI_GET_PDOS failed (-70)

Internet indicates that this is a communication error between the USB subsystem and the power-draw subsystem. perror 70 indicates that this is a “Communication error on send” error. I don’t understand how this could result in a graphic session crash.

Shortly before that, the gpu says in the log that it resumed successfully, and that includes some warning messages, but I would guess that they are unrelated. That part of the log seems to indicate that, at that point in time, everything was fine:

Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Oct 07 08:54:01 framawork kernel: nvme nvme0: 12/0/0 default/read/poll queues
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:223
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:231
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:239
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:247
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
Oct 07 08:54:01 framawork kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[...]
Oct 07 08:54:01 framawork kernel: [drm] ring gfx_32788.1.1 was added
[...]
Oct 07 08:54:01 framawork kernel: PM: resume devices took 0.678 seconds
Oct 07 08:54:01 framawork kernel: OOM killer enabled.
Oct 07 08:54:01 framawork kernel: Restarting tasks ... done.
Oct 07 08:54:01 framawork kernel: random: crng reseeded on system resumption
Oct 07 08:54:01 framawork systemd-resolved[1567]: Clock change detected. Flushing caches.
Oct 07 08:54:01 framawork kernel: PM: suspend exit
Oct 07 08:54:01 framawork systemd-logind[1629]: Lid opened.
Oct 07 08:54:01 framawork systemd-sleep[1728476]: System returned from sleep operation 'suspend'.

Questions

My impression is that this is a kernel (probably amdgpu) or wayland issue, but I am surprised to see no clearer indication of failure in the logs. If all graphical clients suddenly lose their display device, this suggests that something crashed (and not just the UCSI-PDOS thing), why is that something not logging a failure?

Do people here know how to configure the relevant subsystems to provide more information on failure?

James3 · October 7, 2024, 8:02am

You could use the amd_s2idle.py script to test this. It will give you a better idea as to what is wrong.

truffaldino · October 7, 2024, 11:41pm

This is a long shot as I have a 12th gen FW13 and run Linux Mint which is nothing like your setup. Bear with me a moment.

For months the filesystems on my SSD would get remounted read-only when I woke the FW13 from idle by whatever means. Aside from completely destabilising the system, the read-only filesystems also prevented useful diagnostic data from being captured in the logs.

Ultimately this was resolved by updating the SSD firmware (I was using a WD SN770). You can read about that journey here. YMMV.

Dino