[RESPONDED] FW13 AMD Fedora 39: System clock advances 50+ years during overnight suspend

Mario_Limonciello · November 16, 2023, 3:07pm

There is a very similar kernel report for this issue on some other products. AFAIK AMD has never reproduced it, and only seen by the two reports there previously.

There is a debugging patch specifically attached to that bug report. Any of you guys that can reproduce this issue, would you mind rebuilding your kernel with that patch? If you can reproduce the issue it will add a lot more context about the situation that lead to it which could be helpful at finding what is actually wrong in the kernel when this happens.

dimitris · November 16, 2023, 5:29pm

@Thomas_Weissschuh I also had a bunch of unable to read current time from RTC:

Tue 2023-11-14 23:46:33 PST angua kernel: PM: suspend entry (s2idle)
Tue 2023-11-14 23:46:34 PST angua rtkit-daemon[1675]: Successfully made thread 8887 of process 8852 (/usr/bin/gnome-shell) owned by '1000' high priority at nice level 0.
Tue 2023-11-14 23:46:34 PST angua kernel: Filesystems sync: 0.021 seconds
Tue 2023-11-14 23:46:34 PST angua rtkit-daemon[1675]: Successfully made thread 8887 of process 8852 (/usr/bin/gnome-shell) owned by '1000' RT at priority 20.
Tue 2077-09-28 18:41:15 PDT angua kernel: Freezing user space processes
Tue 2077-09-28 18:41:16 PDT angua kernel: Freezing user space processes completed (elapsed 0.001 seconds)
Tue 2077-09-28 18:41:16 PDT angua kernel: OOM killer disabled.
Tue 2077-09-28 18:41:16 PDT angua kernel: Freezing remaining freezable tasks
Tue 2077-09-28 18:41:16 PDT angua kernel: Freezing remaining freezable tasks completed (elapsed 0.058 seconds)
Tue 2077-09-28 18:41:16 PDT angua kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Tue 2077-09-28 18:41:16 PDT angua kernel: queueing ieee80211 work while going to suspend
Tue 2077-09-28 18:41:16 PDT angua kernel: PM: suspend devices took 0.179 seconds
Tue 2077-09-28 18:41:16 PDT angua kernel: ACPI: EC: interrupt blocked
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC
Tue 2077-09-28 18:41:16 PDT angua kernel: Unable to read current time from RTC

but no mach_set_cmos_time in the journal at all.

@Mario_Limonciello I can’t promise I’ll succeed in building a patched kernel, haven’t done this since literally the last millennium. I think I’ll follow the Fedora guide.

@Loell_Framework something to run by you and/or Kieran (trying to limit tagging to people already in the thread): The kernel bug entry that Mario mentioned indicates that the EC can, in general, have an indirect effect on RTC behavior/use during s2idle. I have a couple of spare rechargeable cells available on a just-in-case basis for the two 11 gen machines in the household. Would it hurt/be worth a try to install one in this AMD machine’s empty holder to see if it has any effect? Also any EC thoughts about this clock issue in general?

Mario_Limonciello · November 16, 2023, 5:55pm

that the EC can, in general, have an indirect effect on RTC behavior/use during s2idle .

IIRC the Framework EC is connected over eSPI, which it’s possible to read RTC time values through. Given all these failures are happening around the s2idle sequence is it plausible that it’s requesting RTC time values at the same time as Linux is?

Thomas_Weissschuh · November 16, 2023, 6:26pm

Maybe it’s relevant:

The probing of cros_ec_lpc fails.
The ID read via MEC is “0x00 0x00” and via non-MEC it’s “0xff 0xff”.

I applied the provided patch, maybe I can reproduce it.

qemu-system-x86_64 · November 16, 2023, 6:47pm

Johnny ?

jwp · November 16, 2023, 10:16pm

Yeah as far as I can tell the framework ec_sros_lpc patches that went in sometime around the 6.2 series don’t support the newer ec in the amd framework.

They are certainly not in any of the mainline trees if they exist at all. Have asked if ec_cros_lpc loads with the magic OEM kernel people mention for the ubuntu distro. But I haven’t found anything in any of the trees i’ve looked through.

There is a ec_tool efi loadable i’ve tried and it also doesn’t support the ec on the amd framework; spitting out invalid checksum.

dimitris · November 16, 2023, 10:25pm

Just to check, did you mean cros_ec_lpcs?

$ lsmod |grep cros
cros_ec_lpcs           20480  0
cros_ec                20480  1 cros_ec_lpcs

$ dmesg |grep cros_ec
[   20.641961] cros_ec_lpcs cros_ec_lpcs.0: EC ID not detected

Mario_Limonciello · November 16, 2023, 10:27pm

The cros-ec support for Framework AMD is this patch series: [PATCH v1 0/4] cros_ec: add support for newer versions of the Framework Laptop (kernel.org)

jwp · November 16, 2023, 10:34pm

Thanks @Mario_Limonciello my google foo is not as good as yours.

This is still out of tree for 6.7 currently yah?

jwp · November 16, 2023, 10:35pm

@dimitris ; yup - dyslexia strikes again

Mario_Limonciello · November 16, 2023, 10:41pm

Well it helps that I was CC’ed on the series

Yes, Dustin didn’t submit a v2 AFAIK to take into account the trivial review feedback.

Mario_Limonciello · November 17, 2023, 12:06am

I noticed that I linked the wrong debugging patch (sorry!). I edited the post.
So if anyone has built a kernel with it, please pick it again and rebuild.

The patch that is linked significantly increases the number of iterations mc146818_avoid_UIP will try and logs when it’s over 100. With this patch in place if you have reproduced the issue you’ll see a warning in your logs:

reading the RTC time required %d loop iterations

But hopefully your clock doesn’t jump forward. Please share logs with that patch in place to see how many iterations it required.

Matt_Hartley · November 17, 2023, 12:21am

Looks Mario has begun trucking through this, but I am CCing to engineering now.

Mario_Limonciello · November 20, 2023, 2:23pm

I’ve sent this series up to the mailing list for this issue.

https://lore.kernel.org/linux-rtc/20231120141555.458-1-mario.limonciello@amd.com/T/#m5234a9a5cd4c320efa69fc591d626efa89c5bf5d

I have never reproduced the issue though so please let me know if you reproduced it with that patch series applied.

Matt_Hartley · November 20, 2023, 11:32pm

Amazing work, thank you!

jwp · November 21, 2023, 4:45am

Have run up a new build of my patched kernel with this against the fedora 6.7-rc2 os-build tree. And removed the rtc kernel flag - will let you know if I encounter any time skipping.

jwp · November 21, 2023, 4:56am

I am still seeing these:

2023-11-21 17:49:38,716 DEBUG:  [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
2023-11-21 17:49:38,717 DEBUG:  [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait

consistently during resume running the amd_s2idle.py script ; is there an open bug in the amd gitlab for this? As it’s still there with latested mainline patches and linux-firmware for Phoenix.

Mario_Limonciello · November 21, 2023, 5:54am

And removed the rtc kernel flag - will let you know if I encounter any time skipping.

There are two sets of patches, one for using ACPI for RTC alarm and one for UIP clear not happening in 10ms. Make sure that you’ve got both in your test kernel if you’re not using the kernel command line parameter.

I am still seeing these:

Functionally harmless right?

consistently during resume running the amd_s2idle.py script ; is there an open bug in the amd gitlab for this? As it’s still there with latested mainline patches and linux-firmware for Phoenix.

Nothing is opened in AMD Gitlab for this. FWIW I believe it’s caused by a firmware included in the BIOS not Linux in this case.

jwp · November 21, 2023, 8:54am

Ahh - for some reason patchwork is titling them the same :

https://patchwork-proxy.ozlabs.org/project/rtc-linux/list/?submitter=81779

Mario_Limonciello · November 21, 2023, 1:21pm

Here’s the other one.

https://lore.kernel.org/linux-rtc/20231106162310.85711-1-mario.limonciello@amd.com/