Does anyone on this thread use ectool? I have it installed and polling fan speed every ~2 secs, and I’m wondering if it might be the cause.
I just had this happen again, and had definitely checked after resuming from suspend (and everything was fine), and now after leaving my laptop for ~30 mins (on battery power, without it suspending) on returning it’s not working again.
So this rules out the hypothesis where it’s caused by something going wonky during resume from suspend. I also noticed this in my dmesg output:
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
cros_ec_lpcs cros_ec_lpcs.0: packet too long (29810 bytes, expected 8)
ectool[1493750]: segfault at 0 ip 0000000000000000 sp 00007fffec92c4c8 error 14 in ectool[5604a8ae3000+5000]
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
cros_ec_lpcs cros_ec_lpcs.0: packet too long (264 bytes, expected 8)
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
The checksum and “packet too long” errors are common, and I see them at other times also, without noticing this problem. The segfault is less common; I’ve seen it before, but not often, and haven’t tried to correlate it with this keyboard problem. There weren’t any such messages yesterday around the time that I encountered the problem.
In any case, some sort of interaction with ectool seems to me like the most likely candidate, currently. I might try to increase the polling rate on my system, to see if the problem occurs more often…
I had this happen again today. I think it was while I was plugged into power, but I’m not 100% sure, mostly because I don’t regularly check if the keys are working - I just go to use one, and notice that it doesn’t work.
But the interesting thing is that it fixed itself unexpectedly without requiring a suspend-resume cycle!
My normal setup is using the laptop at my desk, with a USB-C dock providing USB PD + DP-AltMode (first external monitor) + a few USB-A peripherals, and the Framework HDMI adapter (second external monitor). Sometimes I need to leave my desk (eg. move to a meeting room), in which case I unplug both plugs, and disable the external monitors in software (with xrandr, I still use Xorg, and not doing this causes the DP/HDMI screens to not re-attach properly on reconnect), but usually not sleep the laptop.
Today the keys were working fine at the start of the day, and I had one session away from my desk. At some point after returning to my desk and plugging back in, I noticed that the keys weren’t working again. I put up with it for a while, re-checking it periodically. Eventually I gave up and decided to do a suspend-resume cycle to fix it. In preparation for this, I unplugged the 2 cables and disabled the external screens in software. The surprising part was when I then checked the keys again, they were completely back to normal!
either via USB-C or the Framework HDMI adapter (which I believe both use DP-Altmode in much the same way)
Over the coming days I will be keeping a much closer eye on the status of the keys before/after each of these events, to try to narrow things down further.
I have this happen too, but it seems to happen after a suspend/resume. I don’t recall it ever happening spontaneously but I’ll keep an eye out for that. I have an 11th gen on NixOS with kernel 6.2.2. One thing I’m testing is whether it is related to deep sleep vs s2idle.
The first case is interesting because it showed that if the problem is related to plugging/unplugging usb-c/hdmi, it can take more than a few seconds to manifest. The timeline was something like:
Resume from suspend, keys were fine.
Plug into USB-C and HDMI. Keys remained fine through the stages of doing this.
Some time later (1-2 hrs?), unplugged HDMI and USB-C. Keys were fine when I checked a few seconds after each cable was unplugged and after the external displays were disabled in xrandr.
~5 mins later, the keys were not working.
The second case was interesting because it completely rules out plugging/unplugging USB-C/HDMI. The timeline was something like:
Resume from suspend, keys were fine.
Don’t plug in anything.
~ 20 mins after resuming, keys were not working anymore. I was checking every few-5 mins, and the keys had been fine, until they weren’t.
During this time I didn’t enable/disable any hardware, plug/unplug anything, or move the laptop. It was just typing and touchpad with the laptop on a desk, connected to wifi. (The wifi is flakey, and has a habit of roaming between APs on the same ESSID every few mins. But surely this shouldn’t affect the EC? Also the earlier case above was on a different wifi network with a single BSSID.
So, based on this I agree that the cause is more likely somehow related to resume from suspend (but with some sort of delay), or some completely different and unrelated trigger.
I’ve had a look at the EC source code, in particular for why the arrow keys behave as if Fn isn’t pressed, but the F1-F12 keys behave as if Fn has been pressed.
The relevant function seems to be keyboard_scancode_callback in keyboard_customization.c. In particular, it calls:
hotkey_F1_F12 which is responsible for converting F1-F12 scancodes into media key scancodes when Fn isn’t pressed (unless Fnlock is enabled).
If keyboard_scancode_callback doesn’t make it as far as these three functions (hotkey_F1_F12, hotkey_special_key, functional_hotkey), then that would be consistent with the observed behaviour. Above the calls to these three functions are a few places where the function early-returns.
Another is if the device is in “factory state”, which I assume is either the shipping mode or a factory test mode. factory_status() just returns the value of factory_enable, which is static and only set in factory_setting(). That function is only called by factory_mode in host_command_customization.c. If this EC host command (0x3E02) is called with a non-zero arg (regardless of if the arg is RESET_FOR_SHIP or not), then that could explain these symptoms.
Another is checking for “preOS” state. pos_get_state just returns bit 0 of *host_get_customer_memmap(0x00). This appears to be the first byte of a (shared?) comms buffer between the EC and host. The only place that bit 0 gets set is in the “custom hello” ec host command, where it gets cleared. Since the scancode function is testing !pos_get_state(), if the bit is cleared, then the keyboard routine will not do Fn key handling.
So, based on this, my suspicions are returning to ectool shenanigans, and/or the dmesg errors I’m seeing from the cros_ec_lpcs kernel module…
FWIW, the memory region in question (ER1) is also manipulated directly by ACPI on the AP side.
Excellent investigation! I also believe the pos bit to be implicated in this issue.
There is a small window during which in-flight EC exchanges can get corrupted; however, Custom Hello is sent before any code outside the firmware runs. I do not believe the same to be true of ACPI writes to this part of ER1.
I think the flow is that the firmware sends Custom Hello, and then the OS eventually triggers an ACPI method that sets that bit. That should happen early on in ACPI init, though, before any userland EC exchanges.
(Edited for further speculation, and to flip the sense on whether or not I believe ACPI writes to ER1 happen before the firmware exits.)
For me it never happened after suspended/resume since I have this features fully disabled.
Sometimes it happens during typing/working, suddenly Fn doesn’t work anymore. And after a random amount of time, it works again. I only little investigated dmesg and systemd logs, but couldn’t see any suspicious log message.
Interesting, so this could (at least theoretically) be caused by stray ACPI-related writes?
This is kindof what I’m thinking. I’m regularly running ectool pwmgetfanrpm, which runs EC_CMD_GET_FEATURES (0x000D), and often see dmesg errors complaining about ec command checksums and unexpected length mismatches (presumably of the responses). So I think this corruption is actually happening reasonably often. If the command request id gets overwritten to be 0x3E02 or 0x3E07, then that could explain the symptoms.
Another possibility could be a buffer overflow accidentally overwriting the stuff in this region. Do you know what normally sits immediately below it?
I don’t understand this, I thought these EC commands like EC_CMD_CUSTOM_HELLO were commands that could be sent to the EC (and run on the EC) by the host?
Do you know if there’s any way to inspect the values in these regions from inside Linux? It would be very useful to be able to check these bits the next time the issue happens.
Similarly, do you know if there might be any workarounds or mitigations? eg. any way to forcibly set the values from inside Linux? Or otherwise perhaps “reset the EC on the fly”? (Or would that be a bad idea…? It seems like a bad idea )
After not happening much last week, this week it is back to happening several times a day. I’m starting to wonder if it might somehow be environmental?! It seems more likely to occur when running on batteries, and/or when in a noisier RF environment, and/or when moving the laptop from one place to another (without closing it, eg. between rooms). It’s also still much more likely to spontaneously go into the bad mode (no Fn key), than back to working fine (though that has still happened a few times). I really wish there was a way to programmatically introspect the state of the EC.
I’m pretty confident everything was working fine at my desk (plugged into HDMI/power/USB keyboard) on the weekend. I closed it there to suspend it, moved it, and opened it now on battery power with nothing attached and the FN key has failed again.
I’m not sure if there’s anything super relevant, but the dmesg output covering the suspend/wake up is here:
We will take a look at this issue and see what is going on.
Keyboard events go through a separate interface to the ACPI interface.
However the EC will disable all function keys if the OS is not in ACPI mode so if all function keys stop working, including the top row, this may be the cause.
The EC needs a host command from the bios on boot to enable function keys.
This is done in the host_command_customization.c
sci_enable(void)
The keyboard events are going normally through the 8042 interface, it’s just that the scancodes (and behaviour) are wrong.
I’m pretty sure ACPI is enabled, since this happens in the middle of using an otherwise completely normally functioning laptop (including features that I believe require ACPI, like suspend-resume). And so “Fn key stops working” does literally mean “while using the laptop”, and not “after some boots/resumes”.
Is there a way to check/confirm that Linux is still in ACPI mode?
I’ve looked at sci_enable, the problem appears to be that bit 0 of byte 0 of the “customer_memmap” region. The call to pos_get_state() in keyboard_scancode_callback() means that, when the Fn key is working, we can infer that the bit must be set. Which I think suggests that ACPI has been enabled and probably sci_enable has properly run (gone through its main path)?
But if that bit has somehow been cleared, then that could explain the symptoms. But also if factory_enable has somehow been set, then that could also explain it. My outsiders perspective detailed analysis is back on this earlier comment, in case it helps you get started…
Previously, I tried increasing the frequency of how often ectool pwmgetfanrpm is run, to see if it affected the issue (it didn’t, or not noticably). Today I tried the opposite test - competely disabling those calls. I spent 90 minutes in a physical environment / situation where the problem has been readily reproducing for me, and everything was fine. I then re-enabled the ectool pwmgetfanrpm calls (which run every ~0.5 secs), without changing anything else (including the environment / situation), and within 10 minutes the problem had re-occurred and the Fn key was no longer working.
Based on this I’m now pretty convinced that, at least in my case, the ectool usage is causally related. At least one commenter previously said that they weren’t using ectool, but perhaps some other EC comms (eg. by one of the cros_ec* kernel modules?) can cause the problem (perhaps with a lower frequency / incidence).
I think this also correlates with all the dmesg errors I’m sometimes seeing about EC message checksum/packet length errors and ectool segfaults (though I don’t understand why I only sometimes get dmesg complaints about raw port io being locked down…). I’m pretty much back to my original hypotheses: an ectool command request getting mangled such that the EC ends up running 0x3E02 (EC_CMD_FACTORY_MODE) with non-zero arg, or 0x3E07 (EC_CMD_CUSTOM_HELLO), or, something else is causing *host_get_customer_memmap(0x00) & BIT(0) to be cleared or factory_enable to be set (eg. buffer overrun somewhere in the EC?).
I’m using the ectool from DHowett/framework-ec@d5b5b50. Is there a better/official version that I should use instead? For now I’m leaving the ectool fan reporting disabled, and will see if the problem still happens but less frequently…
Ok, I’ve confirmed that it can happen even without using ectool, it just takes a lot longer to occur. Since it happens without ectool, but more frequently with it, it seems unlikely to be something that the cros_ec_lpcs (or whatever) module is doing, and is more likely to be something on the EC side?
This has been happening to my 11th Gen (Gen 1, batch 1) for months. I’m a fairly low-speed, high-drag user running pop!OS, and I see it only in between sleeps. The only way I clear the error is with a reboot. I have not been bothered enough to do any real troubleshooting, but the more days between reboots, the more likely I will see it.
I’ve had the issue yesterday, maybe I should’ve posted here instead in the other thread. Now I’ve rebooted, so it’s gone away, but the dmesg should still be of interest:
I’ve also been seeing this issue recently, on an 11th gen running Ubuntu 22.10.
I’m pretty sure this never happened when I first got the laptop a year ago. I believe I’ve first seen this happen about a month ago, maybe a bit more. Looking back, it could have been a BIOS update (I updated to 3.17 in january, though I think I first saw the issue one or two months after that), of maybe an Ubuntu upgrade (I upgraded from 21.something to 22.10 one or two months ago).
As for ectool - I have a version compiled and installed that I have been using to manually check the ec console every now and then, but I’m not using ectool periodically for anything, so that cannot be the cause for me.
I’m also seeing the packet too long errors (and saw bad checksum once) in dmesg.
As for when this happens - I have the feeling it is related to suspending or usb-c docking/undocking, but I also realize there is a bias there - when I’m docked, I use a different keyboard, so it might very well be broken already when I dock, but I won’t notice until I undock, suspend my laptop and then notice when I unsuspend again when I’m on the road.
There is definitely something going on regarding uptime. I recently rebooted after ~70 days. Before doing that, I reinstated my ectool polling, and the problem easily re-occurred in under 10 mins. However, after rebooting, with the same setup, I’m unable to make the problem re-occur, even after running the same polling for several hours.
The main difference that I’m aware of (apart from how long since the system had been rebooted) is a slightly more recent Ubuntu 22.04 kernel: I’m now on 6.0.0-1014-oem, whereas before it was 6.0.0-1010-oem. I guess another difference is that the system is generally more lightly loaded than before (less processes, cpu, and memory currently in use). Over time I’ll see if the problem eventually reoccurs on this slightly more recent kernel version.
This also started happened to me on Ubuntu 22.10 shortly after updating firmware from 0.0.3.4 to 0.0.3.17. I’ve downgraded to 0.0.3.10 and will report back if this seems to resolve the issue.