Updated description/steps to reproduce (same as from my support ticket, disregard references to log/console attachments):
Steps to reproduce (all while connected to the dock):
Fresh boot.
Configure htop to show kernel threads (capital K to toggle), and also include the core/CPU ID in the results (use F2 to go into setup for that).
Run htop (with kernel and note the absence of kworker/0:n-pm kernel threads, especially from CPU 0 where presumably this class of interrupts is serviced.
Capture the EC console (sudo ectool console). See attachment fresh_boot.txt. Note the relatively few PORT80 entries.
Suspend the machine (power button, followed by lid close) and wait up to a minute.
Open the lid
Go to htop again, wait a few seconds for it to update. There are now one or more kworker/0:n-pm and similar kernel threads, apparently servicing interrupts, on the first core/CPU (numbered 0 or 1 depending on how you’ve configured htop).
Capture the EC console. See attachment after_resume.txt. Note the much increased frequency of PORT80 entries, and the change in the codes reported.
Power-cycle the dock (just needs to be off for 1-2 seconds).
Wait for a few seconds for htop to refresh. If the dock power-cycle was “successful” you won’t see the kworker/0:n-pm putting parasitic load onto the CPU any more. If they’re still there, repeat the power-cycle. I’ve seen this take up to 5 or 6, although usually 1-2 does the trick.
Once the parasitic kernel workers are out of the picture, capture the EC console again. See attachment after_dock_power_cycle.txt. In this case the workaround succeeded after the first power cycle. Note the absence of PORT80 entries after a while.
Original description follows, leaving visible just for context/in case, the htop capture is still relevant.
I noticed this starting with kernel 6.7.3 (edit: noticed because the fan would keep running at low speed while idle, which doesn’t happen normally):
This is on an idle system and that “half core” (sometimes “whole core”) usage seems to start only after one (or likely more) suspend/resume cycles.
kacpid is always floating about on this group of “active” workers.
Once that happens only reboot seems to clear it.
Any ideas how I could trace the (hardware?) cause that’s keeping these worker threads busy?
Under normal use, it has taken a day’s or more of suspend/resume cycles (multiple) to surrface, Would the amd_s2idle/py script be a good tool to speed up the repro?
Well, I managed to reproduce this without suspending/resuming. Or, at least, it didn’t manifest immediately upon resume, but only later, after the machine had been idle for long enough (15 minutes) that the screen saver activated (screens turned off) and the desktop locked.
Upon unlocking I saw the same type of worker busy that hadn’t been before me walking away from the machine.
Important edit here: While watching htop I unplugged the TB4 dock that provides power and connections to various USB devices (keyboard, scanner), waited a few seconds, then plugged it in. This apparently resolved the issue.
Then some time later, after a resume this time, saw this again with a similar "frequency distribution".
More edit: This time a simple unplug/plug in of the TB4 cable wouldn’t do it. I eventually got the issue to clear without a reboot by unplugging from TB4/power, suspending, resuming (no extra load any more) then plugging in (still no extra load).
During both of these events, grep '' /sys/firmware/acpi/interrupts/* showed gpe10 showed it increasing but at a seemingly not high enough rate, something like 10/sec to 20/sec. powertop on the other hand showed an ACPI kworker taking significant time, for whatever that may be worth.
I think I’m set up for debugging: cros_ec kernel patches applied (V2 from the one I highlighted here, and ectool built from the hx20-hx30 branch of FrameworkComputer/EmbeddedController.
Thanks Mario, will do. For completeness since this thread will probably be referred to, here’s the console output with the last part showing the logspam mostly going away when the machine is unplugged from AC.
... (lots more PORT80 spam preceding)
PORT80: 3F84
PORT80: 3F88
PORT80: 3F80
PORT80: 3FA0
PORT80: 3FA4
PORT80: 3F84
PORT80: 3F88
PORT80: 3F80
PORT80: 3FA0
[3030047.208300 HC 0x0115 err 1]
PORT80: 3FA4
PORT80: 3F84
PORT80: 3F88
PORT80: 3F80
PORT80: 3FA0
PORT80: 3F84
PORT80: 3F88
[3030047.690900 update charger!!]
[3030047.700000 AC off]
[3030047.701900 event set 0x0000000000000010]
P:3 SET TYPEC RP=1[3030047.713500 cypd_write_reg8_wait_ack pre 0x84 ]
[3030047.717200 cypd_write_reg8_wait_ack pre 0x84 ]
[3030047.720200 PORT_DISCONNECT]
PORT80: AA83
[3030047.726100 board_set_active_charge_port port -1, prev:3]
[3030047.737200 cypd_write_reg8_wait_ack pre 0x4 ]
[3030047.749200 event set 0x0400000000000000]
[3030047.767200 cypd_write_reg8_wait_ack pre 0x80 ]
[3030047.770000 event set 0x0400000000000000]
[3030047.784000 Battery 81% (Display 81.9 %) / ??h:?? to empty]
[3030047.788600 CL: p-1 s-1 i500 v0]
[3030047.790100 TODO Implement pd_set_new_power_request port 3]
PORT80: AA8F
[3030047.888900 cypd_update_power_status:0=0x8]
[3030047.892600 cypd_update_power_status:1=0x8]
PORT80: 3F44
PORT80: 3F48
PORT80: AA8F
[3030052.341300 Battery 81% (Display 81.8 %) / 11h:21 to empty]
[3030058.377500 HC 0x0002]
[3030058.380100 HC 0x000b]
Edit: opened the support ticket, issue summarized there and pointing back to this thread. I expect it’ll be a few days, especially with the Lunar New Year holiday.
I know it’s generally not useful to get “me too” posts, but since nobody else has corroborated this, I’ll say it. I haven’t dug in with ectool, but I have a bunch of busy kworker threads, and gpe10 is firing.
Thank you @dimitris for investigating and opening the bug report.
I’ve opened a support case with FW, where I owe them some logs/info. In the meantime, I’ve characterized this - at least as it happens with my setup - as related to the dock (Kensington 5780T) I’m using for power and USB devices.
The pattern is that, on the majority of resumes from s2idle I’ll notice these parasitic kworkers being busy. It always involves a pm worker - I assume that’s power management - and often the acpi worker is involved too. This never happens on fresh boot.
I can work around this by cycling power on the dock once or a few times. After the first (or second or third) quick-ish power cycle (just a few seconds off), these workers stop eating CPU.
The amount of state machinery involved here (PD/USB4/dock controllers + firmware as well as Linux drivers) seems so be making this race condition a little daunting to chase down.
However, one good first step would be decoding these EC console codes above. Any help on that would be appreciated.