There is a very similar kernel report for this issue on some other products. AFAIK AMD has never reproduced it, and only seen by the two reports there previously.
There is a debugging patch specifically attached to that bug report. Any of you guys that can reproduce this issue, would you mind rebuilding your kernel with that patch? If you can reproduce the issue it will add a lot more context about the situation that lead to it which could be helpful at finding what is actually wrong in the kernel when this happens.
@Thomas_Weissschuh I also had a bunch of unable to read current time from RTC:
Tue 2023-11-14 23:46:33 PST angua kernel: PM: suspend entry (s2idle)
Tue 2023-11-14 23:46:34 PST angua rtkit-daemon[1675]: Successfully made thread 8887 of process 8852 (/usr/bin/gnome-shell) owned by '1000' high priority at nice level 0.
Tue 2023-11-14 23:46:34 PST angua kernel: Filesystems sync: 0.021 seconds
Tue 2023-11-14 23:46:34 PST angua rtkit-daemon[1675]: Successfully made thread 8887 of process 8852 (/usr/bin/gnome-shell) owned by '1000' RT at priority 20.
Tue 2077-09-28 18:41:15 PDT angua kernel: Freezing user space processes
Tue 2077-09-28 18:41:16 PDT angua kernel: Freezing user space processes completed (elapsed 0.001 seconds)
Tue 2077-09-28 18:41:16 PDT angua kernel: OOM killer disabled.
Tue 2077-09-28 18:41:16 PDT angua kernel: Freezing remaining freezable tasks
Tue 2077-09-28 18:41:16 PDT angua kernel: Freezing remaining freezable tasks completed (elapsed 0.058 seconds)
Tue 2077-09-28 18:41:16 PDT angua kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Tue 2077-09-28 18:41:16 PDT angua kernel: queueing ieee80211 work while going to suspend
Tue 2077-09-28 18:41:16 PDT angua kernel: PM: suspend devices took 0.179 seconds
Tue 2077-09-28 18:41:16 PDT angua kernel: ACPI: EC: interrupt blocked
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
but no mach_set_cmos_time in the journal at all.
@Mario_Limonciello I can’t promise I’ll succeed in building a patched kernel, haven’t done this since literally the last millennium. I think I’ll follow the Fedora guide.
@Loell_Framework something to run by you and/or Kieran (trying to limit tagging to people already in the thread): The kernel bug entry that Mario mentioned indicates that the EC can, in general, have an indirect effect on RTC behavior/use during s2idle. I have a couple of spare rechargeable cells available on a just-in-case basis for the two 11 gen machines in the household. Would it hurt/be worth a try to install one in this AMD machine’s empty holder to see if it has any effect? Also any EC thoughts about this clock issue in general?
that the EC can, in general, have an indirect effect on RTC behavior/use during s2idle .
IIRC the Framework EC is connected over eSPI, which it’s possible to read RTC time values through. Given all these failures are happening around the s2idle sequence is it plausible that it’s requesting RTC time values at the same time as Linux is?
Yeah as far as I can tell the framework ec_sros_lpc patches that went in sometime around the 6.2 series don’t support the newer ec in the amd framework.
They are certainly not in any of the mainline trees if they exist at all. Have asked if ec_cros_lpc loads with the magic OEM kernel people mention for the ubuntu distro. But I haven’t found anything in any of the trees i’ve looked through.
There is a ec_tool efi loadable i’ve tried and it also doesn’t support the ec on the amd framework; spitting out invalid checksum.
I noticed that I linked the wrong debugging patch (sorry!). I edited the post.
So if anyone has built a kernel with it, please pick it again and rebuild.
The patch that is linked significantly increases the number of iterations mc146818_avoid_UIP will try and logs when it’s over 100. With this patch in place if you have reproduced the issue you’ll see a warning in your logs:
reading the RTC time required %d loop iterations
But hopefully your clock doesn’t jump forward. Please share logs with that patch in place to see how many iterations it required.
Have run up a new build of my patched kernel with this against the fedora 6.7-rc2 os-build tree. And removed the rtc kernel flag - will let you know if I encounter any time skipping.
2023-11-21 17:49:38,716 DEBUG: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
2023-11-21 17:49:38,717 DEBUG: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
consistently during resume running the amd_s2idle.py script ; is there an open bug in the amd gitlab for this? As it’s still there with latested mainline patches and linux-firmware for Phoenix.
And removed the rtc kernel flag - will let you know if I encounter any time skipping.
There are two sets of patches, one for using ACPI for RTC alarm and one for UIP clear not happening in 10ms. Make sure that you’ve got both in your test kernel if you’re not using the kernel command line parameter.
I am still seeing these:
Functionally harmless right?
consistently during resume running the amd_s2idle.py script ; is there an open bug in the amd gitlab for this? As it’s still there with latested mainline patches and linux-firmware for Phoenix.
Nothing is opened in AMD Gitlab for this. FWIW I believe it’s caused by a firmware included in the BIOS not Linux in this case.