Unfortunately, the problem has just recurred. So this rules out blocking cros_ec_debugfs
(while leaving cros_ec_lpcs
loaded)` as a viable workaround.
I guess it’s also not super unexpected, given the problem - cros_ec_lpcs
is still doing port IO that can conflict with the port IO done by ACPI (unless I’ve misunderstood something).
Uptime is 20 days. The occurrence was also correlated with plugging in a USB-C hub with DP-Alt mode (and enabling the display), which I expect caused some ACPI activity. I have also had heavy fan and battery polling running (see below) to (probably) increase the chances of hitting a collision. Suspend-resume cleared the problem, as before.
There have been lots of errors/warnings in dmesg (usually several per min), but until now they have seemed to be mostly harmless, in that they haven’t caused the Fn problem. I have, however, seen other problems that I suspect are related, specifically around incorrectly reported battery status:
- the battery reports as absent/invalid
- in one case I saw that the /sys entries for it disappeared completely
- in another case it reported as present, but empty (under 5% of capacity) and discharging (despite being plugged in, fully charged, and white chassis led).
This last case has actually happened to me on at least 2 previous occasions, and it especially galling because in the default Linux configuration, when upowerd sees such a situation for > 20 secs, it will initiate a system poweroff, which is incredibly disruptive when the battery isn’t actually about to run out of juice. After the second time I tracked down the problem to upower, and then purged it (and everything that depends on it) from my system. But that’s not possible for people using Gnome, for example, and the upower maintainer isn’t interested in allowing users to opt out of this upower behaviour.
Ordinarily, I have 3 places that poll the fan speed every 5 secs (2 i3status widgets and 1 monitoring script, not synced). They get it via a service that prevents concurrent access and caches the result for 2 secs, though recently this has been disabled. There’s also 4 i3status widgets polling the battery status (which I gather involves ACPI) every 30 secs (again, not synced, though the kernel seems to maybe cache for ~1 sec or something).
In addition to this, for the purposes of trying to forcibly increase the odds of a recurrence, I’ve been running 5 processes that poll the battery every 0.1-0.5 secs, and 5 processes that poll the fan every 0.1-0.5 secs:
trap 'kill $(jobs -p)' INT; for i in {1..5}; do while sleep 0.$i; do paste /sys/class/power_supply/BAT1/hwmon2/*_input; done & done; wait
trap 'kill $(jobs -p)' INT; for i in {1..5}; do while sleep 0.$i; do /usr/local/sbin/ectool pwmgetfanrpm | awk '{printf("%s%s", NR==1?"":" ", $NF)} END{printf("\n")}'; echo; done & done; wait
In terms of an actual solution, am I right in understanding that part of the problem is that ACPI can “call into” the host OS at any time, and this is eg. how it delivers events such as lid switch changes and battery status changes? (I’m a bit out of my depth with all this low-level ACPI/port IO stuff.) And that this in turn is what prevents purely OS-side port IO mutexes from being an effective prevention for the collisions? Does this then mean that the only viable solution for host OS fan management/control is to do that via ACPI instead of raw port IO - and that such a conversion would be a Big Job?