[TRACKING] AMD: small group of kworkers keeping CPU 0 busy after suspend/resume cycle(s)

Still happens with 6.7.6-gentoo-x86_64 (vanilla 6.7.6 + Index of /~mpagano/genpatches/trunk/6.7 )

See if you can rebuild as described here and see if this helps:

I applied V2 of the patch to 6.7.8;

You say: see if this helps. Do you mean the patch is supposed to help with the issue? I was under the impression it only exposed the EC and allowed console output access.

Have you received any useful reply to your support case? Iā€™ve been staying on 6.6.13 for now since it doesnā€™t exhibit the problem. Iā€™m not anxious to file a support case personally, but I hope Framework is giving this attention.

Sorry I havenā€™t yet, things got a little busy. Iā€™ll try to build a 6.7.8 plus the patches today to get up to date console output now that the behavior is more reproducible.

BTW if power cycling the hub/dock involved is an option then I wouldnā€™t hold onto 6.6, there are lots of important fixes, both security and FW-AMD specific, in the 6.7 series.

@Matt_Hartley Iā€™ll go ahead and send an update on my support thread, unless I hear otherwise that it makes more sense to track here? (edit: Iā€™ve responded to the ticket with logs/console output)

1 Like

Just a thought; does reverting ACPI: EC: Fix acpi_ec_dispatch_gpe() Ā· torvalds/linux@b5539eb (github.com) help?

1 Like

Ticket is easier. Thanks. We are at a workshop this week, so my replies here will be very limited.

This is a good idea. I have yet to repo here, which makes this challenging to really put a ton of focus into. If it hits multiple updates and continues, and we can repo, we can bug file.

Tried the patch (edit: The reverse of the patch, that is) on top of 6.7.8, it didnā€™t seem to change the pattern unfortunately.

Thanks for trying.

Same problem here, on 6.7.8-arch1-1.

The behaviour Iā€™m experiencing seems to be caused by a specific action. It seems that if I reboot with AC on, and restrain from unplugging the laptop or closing the lid, it delays the issue for a while. It seems that masking gpe10 before the problem starts will prevent it but so far I havenā€™t been able to validate that completely, as the problem might appear only after 24-48hr on my laptop. It also looks like that masking gpe10 does causes some issues. My laptop does not seem to get out of sleep with the lid if I mask it, I need to use the button if I do.

Maybe important update: Turns out I can also reproduce this with 6.6.14. I installed it on Fedora using the fedora-repos-archive repository:

sudo dnf install fedora-repos-archive
sudo dnf --refresh --enablerepo updates-archive install kernel-6.6.14

Iā€™ll update the title and initial description too accordingly.

1 Like

Added this to the ticket for repo when I return to my home office Friday.

I followed the instructions to build with the v2 patches, but when I try to reboot into it, I get a message about ā€œbad shim signature.ā€ Am I missing a step?

UPDATE: I was able to disable secure boot with sudo mokutil --disable-validation (from Secureboot - Fedora Project Wiki)

Good news, I was able to precisely replicate the problem on demand. I donā€™t have what it takes to track why itā€™s happening, but I can definitely show how itā€™s happening.

To test my hypothesis, I used this command:

watch ā€˜grep . -r /sys/firmware/acpi/interrupts/ | grep gpe10ā€™

Also, this is IMPORTANT: reboot before you test, once you get into the problem, it creates an infinite loop somewhere and you canā€™t test the behaviour anymore.

This interrupt gets called every time I plug a USB-C device in a USB-C port whatever the port might be. On my machine I use the back left one.

If I use a powerbank (Anker Nano II in my case), the interrupts get called when I unplug and when I plug it back. Thatā€™s it. 4 cycles : 8 interrupts.

Now. In my office setup, I use a docking station, specifically a Kensington SDS700T. As soon as I plug my laptop in it, I get a dozen interrupts called, and as soon as I do, even if I permanently disconnect the docking station from the laptop, some sort of internal kernel loop starts calling gpe10 forever and accelerates over time.

After ~24 hours of this, past the first unplug, the handlers starts to hog the CPU until it takes 100% of the handling core, rendering the machine unresponsive.

I did test another docking station, a CableMatters 201308 and it does NOT create the problem, I guess that it might be related to a protocol error that gets into a weird loop.

Both docking station had ethernet plugged in, but nothing else.

To add to the weirdness of the problem. Sometimes the problem pauses itself for a couple minutes, but eventually always restart if I plugged the kensington once.

I will be changing my docking station in the meanwhile, but iā€™m pretty sure this will impact other devices.

Let me know if I can help, ill keep the problematic docking station for now on the side.

1 Like

Try upgrading to 3.03b if you havenā€™t already. This brings an updated EC.

Iā€™m running on 3.03b

I can confirm that this problem existed on 3.03 and still exist on 3.03b.

1 Like

Update:

We had a stack of usb cables, from Ankers to OEM Apple and other power banks laying around (the legit framework one, and a PD 90W).

The behaviour becomes even more interesting. In no circumstances any cable produced the problem while using any of the power brick we had, including the framework one.

However, while using the Kensington Dock, we found only two cables that were able to reproduce the problem, an unbranded one and the one that comes with the station (how unfortunateā€¦). And that, every single time. The anker branded and apple oem does not cause the problem on the station.

It seems that the problem might be related to power negotiation edge cases with certain cables under certain circumstances, regardless of the power that can be provided by the power brick on the other side (which separates the problem from what 3.03b is supposed to fix)

So I guess an easy fix for now is just getting rid of your cable that causes the problem.

1 Like

Interesting. I only have the TB4 cable that came with the dock, and an older and very short TB3 Anker cable. I tried the Anker this morning and still saw this problem. Unless I hear back that youā€™ve reproduced this with your Anker or Apple cables soonish, I might head over to an Apple store to pick up one of their TB4 cables.

I assume both of those that are symptom-free (Anker and Apple) are active cables? IIRC Apple has both active and passive ones, or something to that effect - the longer one (2m?) being active.

Just to make sure, this is how I get to a ā€œclean stateā€:

Best way to test is to reboot while using the Framework OEM power adapter and set the watch on gpe10. You should see 0 (or nothing more than 2-3 lets say)
Then test that every time you unplug/plug, it should increase by 2. If you get there, you are in a stable state.

What you are saying is quite interestingā€¦

I found at least two cables, including the ones that came with the docking station, that have the problem, so you might be out of luck with both of your cables for now. An interesting fact, as you are saying, both of these faulty cables are supposed to be Thunderbolt compatible, maybe the problem comes from this feature. As of now, iā€™m using a nondescript anker usb-c cable that has a fair chance of not supporting thunderbolt, on the faulty dock.

The other dock we have, the CableMatter, has a built-in cable. It seems we donā€™t see the problem with it since it does not support Thunderbolt, itā€™s just DisplayPort over usb-c. My Kensington dock however does support Thunderbolt 4.

I would be tempted to say that the problem comes from Thunderbolt 4 negotiation.