I’m experiencing this error on 11th and 12th gen. I created this topic when I was already at 12th as far as I remember.
So far I didn’t find a solution as well, however the error happens not as often as before within the last few weeks.
I’m experiencing this error on 11th and 12th gen. I created this topic when I was already at 12th as far as I remember.
So far I didn’t find a solution as well, however the error happens not as often as before within the last few weeks.
Was suspend involved and which distro/desktop? I have to date, only seen this on 11th gen and even at that have not reproduced the issue.
Ok so now I just had the problem (on the 12-th gen). It was not after manipulating the TB dock.
The good news: I discovered that I just had to do a s2ram
and then when waking it back up the problem was gone! So, no more need to do a full reboot! (and no need to loose the state of the laptop, no more need to re-open all the files etc.)
I’m having the same problem with my 12th gen on Ubuntu 23.04.
It only happens to me after waking from sleep, and only every 5th time or so. Putting it back to sleep does not fix it for me, only a reboot fixes it.
The Fn functions stop working, only F1-F12 works, no matter if a hold down the Fn key or not. Fn lock also doesn’t work. Fn+space bar for the keyboard backlight also stops working. The keyboard backlight stays on the level it was before going to sleep.
I usually have nothing connected to the laptop except for the charger when putting it to sleep by closing the lid. I have two USB-C modules, one USB-A and one HDMI, if that makes any difference.
I got my Framework in March '23 and it’s been running Ubuntu from the start. I don’t think I’ve noticed this problem in the beginning. It seems to me it only started recently.
Have you tried s2ram
specifically? There are several flavors of “sleep”, so maybe it depends on which you use…
Reporting back in after 2 weeks and I’m pretty sure blocking cros_ec_lpcs
as @Matt_Hartley suggested here ([TRACKING] Fn key stops working on PopOS after a while - #32 by Matt_Hartley) has completely solved the problem for me!
Just be sure to run update-initramfs -u
afterwards to ensure it takes.
Fantastic! Glad to hear it.
As I understand it, that cros_ec_lpcs
is expected to fix things only on the 11th gen, is that right? Are there any similar workarounds for the same issue on 12th gen?
On the support ticket that I have open for this issue, support have asked me to try this workaround on my 12th gen, which I’m currently doing. It’s a bit too early to tell for sure, but so far it looks very promising - no re-occurrences yet, no more dmesg errors from the module, and I haven’t noticed any other ill effects. The main downside is that blacklisting the module means no fan reporting/control via ectool (though it might be possible by allowing ectool to do raw port IO?). Overall I think it’s well worth attempting the workaround on 12th gen and seeing how you go with it.
EDIT: I also cheated and did an rmmod
of the module, without actually rebooting or blacklisting (yet). The rmmod apparently worked despite throwing an error. YMMV.
Hi, new owner of a 13th Gen (intel) with Ubuntu 22.04. On wake after a night in sleep mode, the fn key doesn’t work at all. Undetected. fn+esc does not switch state, and all the keys available in fn mode don’t work.
I tried the 11th gen fix, no success. Will reboot and report back.
Update: fixed after reboot. Will follow up if it restarts after a while in sleep mode as this seemed to be the trigger: I had small periods of sleep mode (<30 min) yesterday with no issues. Seems it’s the “long” sleep that triggered this.
I have not had this reoccur. However, my laptop crashed (unrelated) after ~25 days of uptime. But prior to that I had used it several times for numerous hours in situations that were previously highly likely to cause the problem (unplugged and moving around indoors).
So I think it’s reasonably safe to conclude that the problem is indeed caused by something the cros_ec_lpcs
module (possibly related to all the dmesg warnings/errors that it also emits).
However, I only consider blacklisting the module to be at most a workaround. Without the module it’s not possible to monitor the fan speed from userspace, or do other ectool-related actions. (Even if giving ectool raw port IO access works, that seems like a risky approach that could potentially cause even more problems.)
The module ought to be able to work, and IIRC there were some Framework-specific patches to it, which I think may possibly be where the problem is. So now that the likely cause has been more-or-less narrowed down, I’d really value if a Framework engineer could perhaps have a closer look into what exactly is going wrong with these EC-host comms.
(Also, my use case for wanting real time fan-speed monitoring is that I often use the laptop with headphones, which means I can’t hear if the fans spin up loudly, ie. I don’t notice if something is causing heavy cooling. This can sometimes happen without generating very high system load averages, eg. there seems to be a minor bug that sometimes causes interrupt storms to/from snd_hda_intel
. The fan monitoring allows me to visually notice when the fan is working hard, and then if that’s unexpected, look into the problem.)
[With the appropriate disclaimer that I’m not an engineer at Framework Computer,] we now know what’s going on thanks to all the reports here and the case you’ve built up!
The cros_ec_lpcs
driver is generic for any laptop that has a ChromeOS EC on the LPC bus. The patches to add support for the Framework Laptop really just add a device identifier and fix port allocation, but don’t themselves cause the issue.
At the end of the day, it’s the same root cause as this equivalent issue filed by Kieran in the (my) CrosEC Windows driver: EC access is sometimes corrupted. · Issue #3 · DHowett/FrameworkWindowsUtils · GitHub; it will likely also reproduce with coolstar’s crosecbus
driver.
The power and battery state of the machine are managed by ACPI, and the ACPI methods for querying those things call the EC directly³. When it does so, it uses a mutex that can’t be shared with the OS². There’s also a couple ACPI-driven exchanges that occur during wake from sleep. Now, because cros_ec_lpcs
(Linux) and CrosEC
(Windows) use the LPC bus directly, an inflight request from ACPI can collide with an inflight request from one of these drivers.
Since the cros_ec_debugfs
driver (not _lpcs
, mind!) seems to query the EC console repeatedly to surface it via the debugfs interface, it causes a lot of traffic–especially around system startup–that runs a chance of stomping the one ACPI exchange that clears the preOS bit⁴.
Letting ectool
do raw port I/O will “fix” it only because it reduces the incidence of host command exchanges. If you run it in a tight loop starting from the moment the machine wakes, you’ll still encounter some corruption of inflight packets.
I wonder… if you put cros_ec_debugfs
on the disallowlist instead of cros_ec_lpcs
, does it do anything for this issue⁵? (@Matt_Hartley, I would love if you had some cycles spare to help figure out with the community if ..._debugfs
is an effective workaround; if so, people could still use ectool
!)
¹ I’m comfortable saying “we” only because I’m the person who caused this issue
² There’s another generic method (FWMI
) that would allow for the OS to communicate with the EC via ACPI instead of using the I/O ports directly, but using it would need a solid chunk of driver work.
³ Beware, this file is huge. DSDT from 11th gen v3.17 > EC0.M001
⁴ The one that you noted earlier and is used to determine whether to ignore/respect Fn
⁵ This might help all users, minus the subset of people who really are using ectool
during early boot. It would be a more hit-or-miss fix for those folks. It will unequivocally reduce host command traffic!
@DHowett Splendid diagnostic work!!! Really really appreciate!
Why do you say we cannot use ectool
when blacklisting cros_ec_lpcs
?
I have it blacklisted but I still can e.g. set the battery charge limit using ectool
.
Sorry, that was a bit of imprecision on my part.
Without cros_ec_lpcs
, you cannot use ectool
without my patch that adds support for talking to the Microchip EC using raw port I/O. If you’re using a version of ectool
from the fw-ectool
repository (or the Arch Linux package with the same name), it has both that patch and the one that adds fwchargelimit
.
The downside of using fw-ectool
and raw port I/O is that it requires lockdown to be disabled (some distributions enable it by default), which requires that secure boot be disabled (which again, some distributions support by default.) It also uses an interface that cannot be permission-managed like /dev/cros_ec
can⁶.
⁶ That is, if you were using the cros_ec_lpcs
driver you could use udev rules to grant a group access to /dev/cros_ec
instead of having to run ectool
as root all the time.
Finally home from doctor today. Appreciate the assist! Yeah, let me see if I can put together a quick markup guide for testing Thurs or Fri.
Haven’t experienced the issue myself despite trying recreate it, so I like the idea of getting this plugged into a community effort.
We can chat more on Discord. But yes, I’d like for us to throw some things at this, see what results we get.
Excellent suggestion, and overall excellent reply, thanks so much!
I have adjusted my /etc/modprobe.d files accordingly, ie. changed blacklist cros_ec_lpcs
to blacklist cros_ec_debugfs
(and re-run sudo update-initramfs -u
so there are no surprises at next reboot), and then done sudo modprobe cros_ec_lpcs
. I now have lsmod
reporting that cros_ec_lpcs
and a variety of other cros_*
modules are loaded (that previously weren’t), but not cros_ec_debugfs
. I’ve re-enabled my fan reporting, and will monitor for any re-occurrence of the original problem, as well as any cros_ec related output in dmesg
.
One question I have is, do you know if the kernel drivers handle serializing concurrent requests? ie. can two ectool
processes “collide” in the way you describe? Or is it only possible to have collisions between userspace (ie. ectool
) and the kernel (ie. cros_ec_debugfs
)? I’m not too worried either way - when I set up my fan monitoring, I assumed the worst and went to the slight extra effort to ensure it doesn’t run ectool
concurrently (though likely better would be a general locking wrapper around all calls of ectool
). But if it can happen with just ectool
, then I’ll probably publish my scripts to help others avoid the problem. (All of this is under the assumption/expectation that blocking cros_ec_debugfs
will avoid the problem.)
(Also, I wanted to apologise for the testiness of my previous reply. It felt like the conversation (here and on my support ticket) was being steered towards “ok great just block cros_ec_lpcs to fix kthxbi”, when it started out as more of a diagnostic/investigative test.)
Not a great sign, the dmesg warnings are back already:
[Wed Jun 14 13:28:41 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 13:36:41 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 13:37:06 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 13:37:06 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 13:53:31 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (31609 bytes, expected 100)
[Wed Jun 14 13:53:31 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
At ~13:57 I removed the userspace locking around ectool, and also started aggressively polling the fan speed every 0.1 secs (in addition to the regular fan polling). The errors clearly get much more frequent:
[Wed Jun 14 13:58:56 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 13:59:52 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 13:59:52 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:01:14 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:02:16 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:02:38 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 14:03:53 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:03:53 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:05:28 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (31865 bytes, expected 8)
[Wed Jun 14 14:06:45 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
[Wed Jun 14 14:07:11 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (3535 bytes, expected 8)
[Wed Jun 14 14:08:10 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:08:10 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:08:12 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (38244 bytes, expected 100)
[Wed Jun 14 14:08:12 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:08:41 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 00
[Wed Jun 14 14:09:17 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 14:09:33 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 14:10:03 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:10:03 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:11:50 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:11:50 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:11:54 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:11:54 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:15:05 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:15:05 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:15:12 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:15:12 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:16:44 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
[Wed Jun 14 14:19:04 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:19:04 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:19:11 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (31351 bytes, expected 12)
[Wed Jun 14 14:19:20 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:19:20 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:19:23 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
[Wed Jun 14 14:19:27 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:19:27 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:19:47 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:19:47 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:20:34 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (31607 bytes, expected 8)
[Wed Jun 14 14:22:23 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 14:25:19 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:25:19 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:25:36 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:25:36 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:26:50 2023] ectool[2814316]: segfault at 0 ip 0000000000000000 sp 00007ffdb4ec2738 error 14 in ectool[564368e56000+5000]
[Wed Jun 14 14:26:50 2023] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[Wed Jun 14 14:27:55 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:27:55 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:28:54 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:28:54 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:29:26 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:29:47 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:30:28 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:30:28 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:30:29 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:30:29 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:30:43 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:30:43 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:31:31 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:34:26 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:34:26 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
At ~14:35 I started another process polling every 0.1 secs. The frequency of the errors has roughly doubled:
[Wed Jun 14 14:35:17 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (3535 bytes, expected 100)
[Wed Jun 14 14:35:17 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:36:42 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:36:52 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:36:52 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:37:31 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:37:31 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:37:53 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:37:53 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:37:53 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:38:17 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 00
[Wed Jun 14 14:38:17 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:38:43 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 14:39:07 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
[Wed Jun 14 14:39:43 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:39:43 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:39:44 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:39:44 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:40:17 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:40:17 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:40:55 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (32634 bytes, expected 8)
[Wed Jun 14 14:41:18 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (32634 bytes, expected 8)
[Wed Jun 14 14:41:41 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:41:50 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum f6
[Wed Jun 14 14:41:50 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:42:15 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
[Wed Jun 14 14:42:22 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:42:22 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:42:31 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (264 bytes, expected 8)
[Wed Jun 14 14:42:32 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:42:32 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:43:05 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:43:05 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:43:10 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:43:10 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:43:11 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:43:11 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:43:15 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:43:15 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:43:41 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:43:41 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:43:41 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (30824 bytes, expected 8)
[Wed Jun 14 14:44:01 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:44:01 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:44:28 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:44:28 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:44:50 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (3535 bytes, expected 100)
[Wed Jun 14 14:44:50 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:44:50 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:45:08 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:46:13 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:46:13 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:46:26 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:46:26 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:46:53 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:46:59 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (32003 bytes, expected 12)
[Wed Jun 14 14:47:05 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
[Wed Jun 14 14:47:11 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:47:11 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:47:11 2023] cros_ec_lpcs cros_ec_lpcs.0: packet too long (30824 bytes, expected 8)
[Wed Jun 14 14:47:47 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:47:47 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:48:06 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:48:06 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:48:11 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
[Wed Jun 14 14:48:11 2023] ACPI: battery: [Firmware Bug]: (dis)charge rate invalid.
[Wed Jun 14 14:48:20 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:48:20 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:48:31 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
[Wed Jun 14 14:48:31 2023] Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
[Wed Jun 14 14:49:08 2023] cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
I’ve also noticed that when one of these errors occurs, one of the invocations of ectool reports a fan speed of 0.0 (which makes sense).
The actual Fn issue hasn’t occurred (…yet?).
Interesting! I’d expect it’s a lot more likely to hit with increased ectool
activity.
How often are you polling?
Yes. The kernel driver handles locking around concurrent requests originated both by the kernel and through the userspace /dev/cros_ec
node. In its default configuration, ectool
is covered by the same lock that governs kernel EC I/O.
If you’re using raw port I/O (which I guess you’re not, because of the lockdown messages), ectool
doesn’t share the kernel lock. Inflight userland requests via raw I/O will interact poorly with inflight kernel requests.
However, ectool
maintains its own file-based lock in addition to the kernel lock. Two instances of ectool
should not be able to interfere with eachother, even if they are using raw I/O.
The core issue isn’t kernel<->userland locking or the lack thereof, though. It’s a lack of locking between the ACPI bytecode (which talks to the EC directly using port I/O) and the cros_ec_lpcs
driver or ectool
using raw port I/O. That is, it’s between [the OS, both kernel and userspace] and [the firmware]⁷.
⁷ Admittedly, ACPI AML is interpreted by the OS and run on the CPU… but still, as a virtual machine, it can only share specific resources with the host OS through specific interfaces. A lock isn’t one of the shareable things.
Unfortunately, the problem has just recurred. So this rules out blocking cros_ec_debugfs
(while leaving cros_ec_lpcs
loaded)` as a viable workaround.
I guess it’s also not super unexpected, given the problem - cros_ec_lpcs
is still doing port IO that can conflict with the port IO done by ACPI (unless I’ve misunderstood something).
Uptime is 20 days. The occurrence was also correlated with plugging in a USB-C hub with DP-Alt mode (and enabling the display), which I expect caused some ACPI activity. I have also had heavy fan and battery polling running (see below) to (probably) increase the chances of hitting a collision. Suspend-resume cleared the problem, as before.
There have been lots of errors/warnings in dmesg (usually several per min), but until now they have seemed to be mostly harmless, in that they haven’t caused the Fn problem. I have, however, seen other problems that I suspect are related, specifically around incorrectly reported battery status:
This last case has actually happened to me on at least 2 previous occasions, and it especially galling because in the default Linux configuration, when upowerd sees such a situation for > 20 secs, it will initiate a system poweroff, which is incredibly disruptive when the battery isn’t actually about to run out of juice. After the second time I tracked down the problem to upower, and then purged it (and everything that depends on it) from my system. But that’s not possible for people using Gnome, for example, and the upower maintainer isn’t interested in allowing users to opt out of this upower behaviour.
Ordinarily, I have 3 places that poll the fan speed every 5 secs (2 i3status widgets and 1 monitoring script, not synced). They get it via a service that prevents concurrent access and caches the result for 2 secs, though recently this has been disabled. There’s also 4 i3status widgets polling the battery status (which I gather involves ACPI) every 30 secs (again, not synced, though the kernel seems to maybe cache for ~1 sec or something).
In addition to this, for the purposes of trying to forcibly increase the odds of a recurrence, I’ve been running 5 processes that poll the battery every 0.1-0.5 secs, and 5 processes that poll the fan every 0.1-0.5 secs:
trap 'kill $(jobs -p)' INT; for i in {1..5}; do while sleep 0.$i; do paste /sys/class/power_supply/BAT1/hwmon2/*_input; done & done; wait
trap 'kill $(jobs -p)' INT; for i in {1..5}; do while sleep 0.$i; do /usr/local/sbin/ectool pwmgetfanrpm | awk '{printf("%s%s", NR==1?"":" ", $NF)} END{printf("\n")}'; echo; done & done; wait
In terms of an actual solution, am I right in understanding that part of the problem is that ACPI can “call into” the host OS at any time, and this is eg. how it delivers events such as lid switch changes and battery status changes? (I’m a bit out of my depth with all this low-level ACPI/port IO stuff.) And that this in turn is what prevents purely OS-side port IO mutexes from being an effective prevention for the collisions? Does this then mean that the only viable solution for host OS fan management/control is to do that via ACPI instead of raw port IO - and that such a conversion would be a Big Job?
Any chance this issue has been fixed yet? It’s super annoying.