Linux FW16 will not power on after suspend

lbkNhubert · May 25, 2024, 2:51pm

This might help to debug: scripts/amd_s2idle.py · master · drm / amd · GitLab

Jason_Rivard · May 25, 2024, 3:17pm

@lbkNhubert thanks! Summary below. Attachments of text files is blocked and it’s too big to paste here, is there anything specific I should look for? I don’t know how to read most of it. The only thing that jumps out at me is the tainted kernel message. I’m running the stock OS kernel so I’m not sure what that’s about. Does this help?

Debugging script for s2idle on AMD systems
💻 Framework Laptop 16 (AMD Ryzen 7040 Series) (16in Laptop) running BIOS 3.3 (03.03) released 03/27/2024 and EC unknown
🐧 openSUSE Tumbleweed
🐧 Kernel 6.9.1-1-default
🔋 Battery BAT1 (NVT FRANDBA) is operating at 101.37% of design
Checking prerequisites for s2idle
✅ Logs are provided via systemd
✅ AMD Ryzen 7 7840HS w/ Radeon 780M Graphics (family 19 model 74)
✅ LPS0 _DSM enabled
✅ ACPI FADT supports Low-power S0 idle
✅ HSMP driver `amd_hsmp` not detected (blocked: False)
✅ PMC driver `amd_pmc` loaded (Program 0 Firmware 76.82.0)
✅ USB4 driver `thunderbolt` bound to 0000:c3:00.5
✅ USB4 driver `thunderbolt` bound to 0000:c3:00.6
✅ GPU driver `amdgpu` bound to 0000:c1:00.0
✅ System is configured for s2idle
✅ NVME Sandisk Corp WD Black SN850X NVMe SSD is configured for s2idle in BIOS
✅ GPIO driver `pinctrl_amd` available
❌ Kernel is tainted: 4096
Your system does not meet s2idle prerequisites!
Explanations for your system
🚦 Kernel is tainted
        A tainted kernel may exhibit unpredictable bugs that are difficult for this script to characterize.
        If this is intended behavior run the tool with --force.

For more information on this failure see:
        https://gitlab.freedesktop.org/drm/amd/-/issues/3089

lbkNhubert · May 25, 2024, 4:03pm

Just updated the script and ran it on my setup and got the all-clear. I definitely would look at why the script is complaining about the kernel. Maybe first run the script with the --force flag as noted and see what it reports then. Also take a look at the issue noted - looks like in that case it was a kernel module causing problems with the script.

Jason_Rivard · May 26, 2024, 12:09am

Thanks again for the help! I did some digging and figured out that virtualbox packages include a kernel module and that is enough to cause the tainted kernel message. So I uninstalled virtualbox and the tainted kernel message no longer appears on the debug script. Also - sorry from before I didn’t understand the script actually does a test and not just a debug log. When I run it now I get all green:

Debugging script for s2idle on AMD systems
💻 Framework Laptop 16 (AMD Ryzen 7040 Series) (16in Laptop) running BIOS 3.3 (03.03) released 03/27/2024 and EC unknown
🐧 openSUSE Tumbleweed
🐧 Kernel 6.9.1-1-default
🔋 Battery BAT1 (NVT FRANDBA) is operating at 101.91% of design
Checking prerequisites for s2idle
✅ Logs are provided via systemd
✅ AMD Ryzen 7 7840HS w/ Radeon 780M Graphics (family 19 model 74)
✅ LPS0 _DSM enabled
✅ ACPI FADT supports Low-power S0 idle
✅ HSMP driver `amd_hsmp` not detected (blocked: False)
✅ PMC driver `amd_pmc` loaded (Program 0 Firmware 76.82.0)
✅ USB4 driver `thunderbolt` bound to 0000:c3:00.5
✅ USB4 driver `thunderbolt` bound to 0000:c3:00.6
✅ GPU driver `amdgpu` bound to 0000:c1:00.0
✅ System is configured for s2idle
✅ NVME Sandisk Corp WD Black SN850X NVMe SSD is configured for s2idle in BIOS
✅ GPIO driver `pinctrl_amd` available
How long should suspend cycles last in seconds (default 10)?

When I continue forward it successfully (I think) enters sleep mode, but like always it won’t resume and must be powered off.

For grins I also tried a Ubuntu 24.10 (as opposed to Kubuntu previously) live CD and get exactly the same results with suspend working but the machine won’t resume and must be powered off.

Any other debugging ideas? Thanks again!

lbkNhubert · May 26, 2024, 1:04am

No need to apologize! I am in over my head at this point. I have used this page to try to debug hibernate, it has notes for suspend as well: Debugging hibernation and suspend — The Linux Kernel documentation. You might see if there is anything in the system logs, and also might enable sysrq (Keyboard shortcuts - ArchWiki) in order to try to drop to a shell when it is stuck. Not sure if that will work but you might try it.

Mario_Limonciello · May 26, 2024, 11:28am

Maybe an nvme firmware upgrade. Most vendors don’t publish for lvfs unfortunately but you can use fwupdmgr to confirm the current version at least.

Btw if this helps can you please note the version you had before and after? I think we should flag it in the script. You can just capture fwupdmgr get-devices output before and after the nvme firmware upgrade.

Jason_Rivard · May 27, 2024, 2:03am

Thanks for the idea Mario! I did indeed have an available firmware update for my WD SN850X SSD 4TB, I went from 624331WD to 624361WD (required a boot to windows). Unfortunately it didn’t fix the issue, I still have the case where I can’t power on after suspend.

Mario_Limonciello · May 27, 2024, 2:54am

Is it happening in Windows too even after the nvme firmware upgrade? Or localized to Linux only failure?

Mario_Limonciello · May 27, 2024, 2:56am

Also, Tpm disabled - how? From BIOS? If so can you please reenable and try again?

Jason_Rivard · May 27, 2024, 9:46am

It works in windows - both before and after nvme firmware upgrade. It fails on linux, so far every distro I’ve tested.

Jason_Rivard · May 27, 2024, 9:46am

Correct, TPM disabled via BIOS. It fails with it enabled or disabled.

Mario_Limonciello · May 27, 2024, 12:11pm

Can you please try amd_iommu=off on kernel command line for Linux?

Jason_Rivard · May 27, 2024, 4:38pm

@Mario_Limonciello thanks for the suggestion! I tried amd_iommu=off on kernel grub options and it had no effect. Machine still suspends but does not power on after suspend.

Mario_Limonciello · May 27, 2024, 4:49pm

Ok thanks.

Can you go back to your Ubuntu 24.04 live image and try there again since you did your nvme firmware update? If that’s still failing too then I think you should open a support case with Framework.

Sean_Heath · May 28, 2024, 1:57am

I’m getting the same thing. I have the GPU module. Suspend works with no errors but the computer never resumes. Even the amd script fails to resume after the timeout.

I tried Arch, EndeavorOS, and Fedora 40. Same result on all. Tried 3 different NVME SSDs. No change.

Mario_Limonciello · May 28, 2024, 2:22am

Do you guys have the same NVME by chance? And if so can you compare F/W versions to see if you guys both have tried the same versions?

Jason_Rivard · May 29, 2024, 12:18am

This is my NVME info:

Model Number: WD_BLACK SN850X 4000GB
Firmware Version: 624361WD

Andrew_Worsley · June 7, 2024, 1:44am

I am getting flaky behaviour on Debian 12.5 bookworm with both suspend and hibernate.

It will often suspend or hibernate (not always) but after a 10-30s after recovery will black screen / freeze. Then sometimes will reboot on it’s own or occasionally need to have the power button held to force it off.

Not sure if it is related but the zoom-client snap (official zoom client has big library dependency issues on bookworm) will cause similar problems. Sometimes making alway
the way into attaching to a meeting before black screening / reboot.

It would be nice if there was a way to trap this crash rather than just a black screen and it
automatically rebooting. i.e. Get a kernel oops or something more meaningful. I figure
modern hardware must be able trap what ever the issue is same a log to some persistent store / memory?

Andrew_Worsley · June 8, 2024, 10:35am

Ok - I’m going slightly paranoid - I am now able to run zoom client and hibernate with out any crashes.

The only thing I did was to enable various debugging features (which are disabled by default on my debian kernel!)

echo 1 > /proc/sys/kernel/panic_on_oops
echo 1 > /proc/sys/kernel/panic_on_rcu_stal
lecho 1 > /proc/sys/kernel/sysrq

I also confirmed that the various watchdogs are enabled

# sysctl -a | grep watchdog
kernel.nmi_watchdog = 1
kernel.soft_watchdog = 1
kernel.watchdog = 1
kernel.watchdog_cpumask = 0-15
kernel.watchdog_thresh = 10

Andrew_Worsley · June 8, 2024, 10:46am

Ok - I had the power cable plugged in for all of the above.

Pretty much as soon as I unplugged the power cable (as it was fully charged) it then black screened !

No logs or anything I could see - just black screen then reboot…