[RESOLVED] Kernel 6.8-rc: System freezes after resuming from suspend, reproducers wanted

While trying out the release candidates of linux kernel 6.8, I found a regression in resuming from suspend: At the 2nd (usually, sometimes the 3rd or 4th) suspend-resume cycle, the system freezes, the screen becomes static, no input is processed anymore, and I need to hard-reboot the device.

I’d love if some of you can try to reproduce this yourselfs. There should not be any danger, as the kernel works fine as long as you don’t suspend-resume.

I’ve already reported this to the kernel mailing list, here’s the post: https://lore.kernel.org/regressions/0d3bdb0f-63a7-4c48-b4d4-157b7b7c1689@amd.com/T/#t

While initially writing this forum post, the mail was not online yet. I keep the copy of the report below for better discoverability.


I’ve found a regression somewhere between 6.7.4 and 6.8.0-rc0 that causes my Framework 7840 AMD laptop to freeze after waking it from a suspend. I can reliably trigger the issue, but unfortunately cannot provide useful logs for now and hope to get some help with doing so.

How to reproduce

The last working kernel release was 6.7.4, the issue appeared in all 6.8 release candidates from 1-4.

  1. normally boot the system with a 6.8 kernel
  2. suspend and resume the system 2 to 4 times. Usually, the freeze already occurs at the 2nd resume.
    2.1 graphical approach: I’ve reproduced this directly from the “Suspend” button in my SDDM display manager (X11)
    2.2 TTY approach: You can also directly swith to a tty after boot, log in, and then issue systemctl suspend 2 to 4 times
  3. After resume number 2 or later, the system freezes while resuming and cannot be used anymore. The screen is switched on and displays something, but no inputs are processed and the image is static.
    3.1 graphical approach: When suspending and resuming by closing and opening the laptop lid, the screen is black with the cursor displayed. When doing so while keeping the lid open, the display manager’s background image and a cursor are displayed, but only statically frozen. No keyboard or touchpad input is processed.
    3.2 on a TTY: When suspending and resuming from a tty, a few kernel messages still manage to be printed, but after that no new information is displayed. E.g. when keeping a journalctl -f session open in another tmux pane, that journal output does not update anymore. Keyboard inputs are not processed anymore. Nonetheless, the cursor continues to blink regulary.
    I’ve attached two screenshots showing the situation with kernels 6.7.4 (working) and 6.8 (broken).

Detailed Description

In most cases, the freeze already occurs after the 2nd suspend-resume-cycle.
Unfortunately, this freeze also appears to block IO, as after a forced hard reboot I cannot retrieve relevant information from journalctl -b "-1". The last retrievable log messages are from the successful suspend action.
I welcome any recommendation on how to retrieve valuable information. I have not yet played around with cmdline parameters relevant for debug.

System Details

Distro: NixOS 23.11, kernel 6.8 rc1-4 compiled manually
kernel: Linux version 6.8.0-rc4 (nixbld@localhost) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.40) #1-NixOS SMP PREEMPT_DYNAMIC Sun Feb 11 20:18:13 UTC 2024
hardware: Framework 13 laptop, 7840 AMD series, CPU AMD Ryzen 7 7840U x86_64

In the attachments you find:

  • the used kernel config
  • my cmdline params
  • 2 screenshots from the bug occuring when triggered from a tty

Next Steps

I intend to bisect the issue, but this can take a while due to the need for manual testing and a kernel compile cycle requiring >30min for now.
I am mostly reporting this now already to still raise awareness in the RC phase and to have something where I can point other Framework laptop users towards for reproduction of the bug.


2 Likes

I can confirm this behavior on 6.8 rc4 on NixOS. Also happened on rc1 to rc3

1 Like

Am also reproducing this on Guix with rc3 and rc4, I’ve tried a rebased version of [1] because it seemed very similar but it didn’t help. I’m also running with a couple of patches [2, 3, 4] on top, but will soon try with a clean kernel.

You can reproduce this very easily with amd_s2idle.py, just let it run a couple of suspend/resume cycles and it will hang. As reported above, the system is completely hung, SysRq doesn’t work and nothing useful is kept in logs.

As for bisection, I think I’ll just try to build the kernel manually so that rebuilds are less costly, compared to just using the Nix/Guix approach.

[1] [PATCH v4 1/3] Revert "drm/amd: flush any delayed gfxoff on suspend entry" - Mario Limonciello
[2] Flickering coloured glitchyness on 780M Phoenix iGPU with 6.7.0 kernel on Xorg/plasma (#3097) · Issues · drm / amd · GitLab
[3] [PATCH V13 0/7] amd-pstate preferred core - Meng Li
[4] [PATCH v2 0/4] platform/chrome: cros_ec_lpc: add support for AMD Framework Laptops - Dustin L. Howett

I got CC’ed into your email thread today by Thorsten. I’ll copy what I posted to the email thread there:

There have been a lot of regressions in 6.8-rc both in GPU scheduler, MM
and AMDGPU.

If it’s not already fixed with the stuff that’s going into 6.8-rc5 this
weekend this is going to be a relatively difficult to bisect.

Here is what I would suggest:

  1. Test 6.8-rc4 + these patches:

These two patches are headed into 6.8-rc5.

  1. Test 6.8-rc4 + those 2 above patches + these two from drm-misc-fixes

These two patches are headed into 6.8-rc5.

  1. If that doesn’t work, then do a bisect, but you’ll need to apply the
    following for each (applicable) step.
1 Like

I’ve posted a possible solution to the platform-x86 mailing list.
[PATCH] platform/x86/amd/pmf: Fix a suspend hang on Framework 13 (kernel.org)

1 Like

I can confirm this fixes the issue.
Let’s hope this one makes it into the 6.8 release.

1 Like

Great :smiley:. There’s enough time with one more RC, it should make it.

2 Likes

damn, I just finished bisecting to end up on those PMF commits. Well, it’s nice we have a fix anyways :slight_smile:

1 Like

Relying on that should, I guess I can mark this as resolved.

Can confirm this is fixed on 6.8-rc6.
But there are still other concerning issues in dmesg:

kernel: ACPI: thermal: [Firmware Bug]: No valid trip points!
kernel: ACPI: thermal: [Firmware Bug]: No valid trip points!
kernel: i2c_hid_acpi i2c-FRMW0004:00: device did not ack reset within 1000 ms
kernel: i2c_hid_acpi i2c-FRMW0005:00: device did not ack reset within 1000 ms
1 Like

Thank you very much for fixing this! :slight_smile:

I also ran into this bug with the development version of (K)Ubuntu 24.04, which now ships some kind of 6.8 kernel. Unfortunately there is no way to tell which version exactly (Ubuntu already called it “6.8.0-11” when 6.8-rc4 was the newest available version). But this makes me hopeful that Ubuntu 24.04 will get a fixed kernel in the foreseeable future

Tried 6.8 release and experiencing multiple freezes (even without using suspend). Anyone reproducing it?

Although a custom kernel, I’m on 6.8 and have not experienced anything like that.

I’m running the Ubuntu 24.04 daily build fully up to date and I’m experiencing this too. The kernel is 6.8.0-11-generic #11-Ubuntu SMP PREEMPT_DYNAMIC. My FW16 suspends & resumes perfectly the first time but when I click suspend the 2nd time it always hard locks up the entire machine and I have to hold down the power button until it turns off.

I realize Ubuntu 24.04 isn’t the supported path yet but I’m happy to help debug/troubleshoot it when the final 24.04 is released in a few weeks.

6.8.0-20 has the fix.

1 Like

Should I be using linux-generic-hwe-24.04 (6.8.0-20) or linux-generic-oem-24.04 (6.8.0-20) ?

It doesn’t matter which right now. But it’s fixed in all the -20 images. -11 isn’t final 6.8.0, -20 is.

1 Like

Got it. I installed -hwe 6.8.0-20 and suspend/resume now works every time!