[TRACKING] AMD Framework 13 suspend/resume regression in 6.9rc1

Was trying out prefcore, which landed in this version but it has broken suspend/resume.

Bug here:

https://bugzilla.kernel.org/show_bug.cgi?id=218641

Can you also reproduce using https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py?

Any errors?

Debugging script for s2idle on AMD systems
💻 Framework Laptop 13 (AMD Ryzen 7040Series) (Laptop) running BIOS 3.3 (03.03) released 10/17/2023 and EC unknown
🐧 Arch Linux
🐧 Kernel 6.9.0-rc1-1-mainline
🔋 Battery BAT1 ( ) is operating at 98.21% of design
Checking prerequisites for s2idle
🚦Logs are provided via dmesg, timestamps may not be accurate over multiple cycles
✅ AMD Ryzen 5 7640U w/ Radeon 760M Graphics (family 19 model 74)
✅ LPS0 _DSM enabled
✅ ACPI FADT supports Low-power S0 idle
✅ HSMP driver `amd_hsmp` not detected (blocked: False)
✅ PMC driver `amd_pmc` loaded (Program 0 Firmware 76.70.0)
✅ USB4 driver `thunderbolt` loaded
✅ GPU driver `amdgpu` available
✅ System is configured for s2idle
✅ NVME Micron/Crucial Technology P2 [Nick P2] / P3 / P3 Plus NVMe PCIe SSD (DRAM-less) is configured for s2idle in BIOS
✅ GPIO driver `pinctrl_amd` available
How long should suspend cycles last in seconds (default 10)? 
How long to wait in between suspend cycles in seconds (default 4)? 
How many suspend cycles to run (default 1)? 
Started at 2024-03-25 22:44:15.452054 (cycle finish expected @ 2024-03-25 22:44:29.452064)
Results from last s2idle cycle
○ Woke up from IRQ 9: ACPI SCI
○ gpe0B increased from 155 to 163
✅ Userspace suspended for 0:00:14.902905
✅ In a hardware sleep state for 0:00:07.079467 (47.50%)
🔋 Battery BAT1 ( ) is operating at 98.21% of design

Seems to be OK, but i did run it a few times and one time it “half” waked up, where the screen was on but the power button was still throbbing.

Can you paste the whole thing to gist or somewhere? I want to see if there are any clocksource messages.

Then also try scheduling a wakeup for 15s and then use touchpad or power button before the around 5-7s. Does it wake from the earlier event? If not then this is how you bisect it.

Ok so here’s 2 attempts, one not touching anything(successful wake up)… and second was trying what you suggested… that was interesting… the machine didn’t wake up when pressing the keys on the keyboard or the power button but when it did on the schedule the screen turned on but the power button still throbbed and the keyboard was typing seemingly random characters until I had to reboot.

Can you install the journald python module so it gets the log from the journal instead of dmesg? It’s missing the juicy stuff and I think it will be in that.

There’s a really similar regression I’m looking at on a non-FW but AMD laptop that I think might be the same root cause.

I’m still in the middle of a bisect, but really close. This is my log so far:

$ git bisect log 
git bisect start '--term-new=broken' '--term-old=fixed'
# broken: [4cece764965020c22cff7665b18a012006359095] Linux 6.9-rc1
git bisect broken 4cece764965020c22cff7665b18a012006359095
# fixed: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8
git bisect fixed e8f897f4afef0031fe618a8e94127a0934896aba
# broken: [e5e038b7ae9da96b93974bf072ca1876899a01a3] Merge tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect broken e5e038b7ae9da96b93974bf072ca1876899a01a3
# broken: [1f440397665f4241346e4cc6d93f8b73880815d1] Merge tag 'docs-6.9' of git://git.lwn.net/linux
git bisect broken 1f440397665f4241346e4cc6d93f8b73880815d1
# broken: [508f34f2381eb84b2335abb970b940aefef50a19] Merge tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
git bisect broken 508f34f2381eb84b2335abb970b940aefef50a19
# fixed: [8ede842f669b6f78812349bbef4d1efd0fbdafce] Merge tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux
git bisect fixed 8ede842f669b6f78812349bbef4d1efd0fbdafce
# broken: [bfdb395a7cde12d83a623949ed029b0ab38d765b] Merge tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect broken bfdb395a7cde12d83a623949ed029b0ab38d765b
# fixed: [80a76c60e5f6361c497d464bb6da6ea07e908a0e] Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect fixed 80a76c60e5f6361c497d464bb6da6ea07e908a0e
# fixed: [f0551af021308a2a1163dc63d1f1bba3594208bd] x86/topology: Ignore non-present APIC IDs in a present package
git bisect fixed f0551af021308a2a1163dc63d1f1bba3594208bd
# broken: [a478ffb2ae234ee1ece2b84719762c54d304e2c7] tick: Move individual bit features to debuggable mask accesses
git bisect broken a478ffb2ae234ee1ece2b84719762c54d304e2c7
# fixed: [8e7e247f64a1e0fee430aba28d9108f7598eb237] timers: Introduce add_timer() variants which modify timer flags
git bisect fixed 8e7e247f64a1e0fee430aba28d9108f7598eb237
# broken: [36e40df35d2c1891fe58241640c7c95de4aa739b] timer_migration: Add tracepoints
git bisect broken 36e40df35d2c1891fe58241640c7c95de4aa739b

So it appears to be related to something with changes to the tick if it’s the same.

1 Like

I updated the gist with a version with the systemd python module installed.

Yup this is nearly identical to what I’m seeing on the non-FW laptop:

You:

2024-03-25 23:19:59,551 DEBUG:  clocksource: Long readout interval, skipping watchdog check: cs_nsec: 3626989790 wd_nsec: 3626988345
2024-03-25 23:19:59,551 DEBUG:  ACPI: \_SB_.PEP_: Successfully transitioned to state lps0 ms exit
2024-03-25 23:19:59,551 DEBUG:  ACPI: \_SB_.PEP_: Successfully transitioned to state screen on
2024-03-25 23:19:59,551 DEBUG:  ACPI: EC: interrupt unblocked
2024-03-25 23:19:59,551 DEBUG:  clocksource: Long readout interval, skipping watchdog check: cs_nsec: 9814869193 wd_nsec: 9814866622

Me:

Mar 25 17:22:57 fedora kernel: ACPI: PM: Wakeup after ACPI Notify sync
Mar 25 17:22:57 fedora kernel: PM: resume from suspend-to-idle
Mar 25 17:22:57 fedora kernel: PM: Triggering wakeup from IRQ 0
Mar 25 17:22:57 fedora kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 11142607333 wd_nsec: 11142706334
Mar 25 17:22:57 fedora kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 23999996196 wd_nsec: 24000208822
Mar 25 17:22:57 fedora kernel: ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0.EMBF due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
Mar 25 17:22:57 fedora kernel: ACPI Error: Aborting method \_SB.PEP._DSM due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
Mar 25 17:22:57 fedora kernel: ACPI: \_SB_.PEP_: failed to evaluate _DSM 560de011-64ce-ce47-837b-1f898f9aa461 (0x3021)
Mar 25 17:22:57 fedora kernel: ACPI: \_SB_.PEP_: Failed to transitioned to state lps0 exit
1 Like

Above my paygrade :joy: but if you need anything more let me know.

Should I close the bugzilla bug?

No; it’s a real issue. I’m about out of time to look further today, if you want to pick up the bisect where I left off it’s really close.

1 Like

I might have time to play around tomorrow, it’s bedtime in Ireland. Thanks for the quick response.

Tracking this as well so I can work on a repro this week.

I noticed similar behaviour running 6.9 master kernel last week; but think it was masked by the atkbd issue as I had the ec patches on the same tree.

I finished the bisect.

commit 7ee988770326fca440472200c3eb58935fe712f6
Author: Anna-Maria Behnsen <anna-maria@linutronix.de>                          
Date:   Thu Feb 22 11:37:10 2024 +0100                                                   
                                                                                         
    timers: Implement the hierarchical pull model

Reverting that on 6.9-rc1 helps.

8 Likes

A silly question maybe, but: will prefcore make any difference to the 7840U?
The cores are all the same, there’s no Performance vs Efficiency oriented cores.

Thanks

This series I don’t expect any meaningful differences. There is going to be a part 2 later that ties it to scheduler changes. This will bring more improvements.

2 Likes