[TRACKING] Fn key stops working on PopOS after a while

Does anyone on this thread use ectool? I have it installed and polling fan speed every ~2 secs, and I’m wondering if it might be the cause.

I just had this happen again, and had definitely checked after resuming from suspend (and everything was fine), and now after leaving my laptop for ~30 mins (on battery power, without it suspending) on returning it’s not working again.

So this rules out the hypothesis where it’s caused by something going wonky during resume from suspend. I also noticed this in my dmesg output:

cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 01
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
cros_ec_lpcs cros_ec_lpcs.0: packet too long (29810 bytes, expected 8)
ectool[1493750]: segfault at 0 ip 0000000000000000 sp 00007fffec92c4c8 error 14 in ectool[5604a8ae3000+5000]
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum 95
Lockdown: ectool: raw io port access is restricted; see man kernel_lockdown.7
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7
cros_ec_lpcs cros_ec_lpcs.0: packet too long (264 bytes, expected 8)
cros_ec_lpcs cros_ec_lpcs.0: bad packet checksum e7

The checksum and “packet too long” errors are common, and I see them at other times also, without noticing this problem. The segfault is less common; I’ve seen it before, but not often, and haven’t tried to correlate it with this keyboard problem. There weren’t any such messages yesterday around the time that I encountered the problem.

In any case, some sort of interaction with ectool seems to me like the most likely candidate, currently. I might try to increase the polling rate on my system, to see if the problem occurs more often…

1 Like

Happens to me, and I’m not using ectool, or doing anything similar.

Haven’t noticed anything notable in my very brief glances at dmesg so far, but I’ll take a proper look at it next time I run into this issue.

1 Like

Ok, that’s good to know, thanks.

I had this happen again today. I think it was while I was plugged into power, but I’m not 100% sure, mostly because I don’t regularly check if the keys are working - I just go to use one, and notice that it doesn’t work.

But the interesting thing is that it fixed itself unexpectedly without requiring a suspend-resume cycle!

My normal setup is using the laptop at my desk, with a USB-C dock providing USB PD + DP-AltMode (first external monitor) + a few USB-A peripherals, and the Framework HDMI adapter (second external monitor). Sometimes I need to leave my desk (eg. move to a meeting room), in which case I unplug both plugs, and disable the external monitors in software (with xrandr, I still use Xorg, and not doing this causes the DP/HDMI screens to not re-attach properly on reconnect), but usually not sleep the laptop.

Today the keys were working fine at the start of the day, and I had one session away from my desk. At some point after returning to my desk and plugging back in, I noticed that the keys weren’t working again. I put up with it for a while, re-checking it periodically. Eventually I gave up and decided to do a suspend-resume cycle to fix it. In preparation for this, I unplugged the 2 cables and disabled the external screens in software. The surprising part was when I then checked the keys again, they were completely back to normal!

So this seems to somehow be related to one of:

  • detaching/attaching USB peripherals/hubs
  • detaching/attaching USB-PD power
  • enabling/disabling/attaching/detaching external monitors
    • either via USB-C or the Framework HDMI adapter (which I believe both use DP-Altmode in much the same way)

Over the coming days I will be keeping a much closer eye on the status of the keys before/after each of these events, to try to narrow things down further.

1 Like

I have this happen too, but it seems to happen after a suspend/resume. I don’t recall it ever happening spontaneously but I’ll keep an eye out for that. I have an 11th gen on NixOS with kernel 6.2.2. One thing I’m testing is whether it is related to deep sleep vs s2idle.

This has happened twice more now.

The first case is interesting because it showed that if the problem is related to plugging/unplugging usb-c/hdmi, it can take more than a few seconds to manifest. The timeline was something like:

  • Resume from suspend, keys were fine.
  • Plug into USB-C and HDMI. Keys remained fine through the stages of doing this.
  • Some time later (1-2 hrs?), unplugged HDMI and USB-C. Keys were fine when I checked a few seconds after each cable was unplugged and after the external displays were disabled in xrandr.
  • ~5 mins later, the keys were not working.

The second case was interesting because it completely rules out plugging/unplugging USB-C/HDMI. The timeline was something like:

  • Resume from suspend, keys were fine.
  • Don’t plug in anything.
  • ~ 20 mins after resuming, keys were not working anymore. I was checking every few-5 mins, and the keys had been fine, until they weren’t.

During this time I didn’t enable/disable any hardware, plug/unplug anything, or move the laptop. It was just typing and touchpad with the laptop on a desk, connected to wifi. (The wifi is flakey, and has a habit of roaming between APs on the same ESSID every few mins. But surely this shouldn’t affect the EC? Also the earlier case above was on a different wifi network with a single BSSID.

So, based on this I agree that the cause is more likely somehow related to resume from suspend (but with some sort of delay), or some completely different and unrelated trigger.

1 Like

I’ve had a look at the EC source code, in particular for why the arrow keys behave as if Fn isn’t pressed, but the F1-F12 keys behave as if Fn has been pressed.

The relevant function seems to be keyboard_scancode_callback in keyboard_customization.c. In particular, it calls:

  • hotkey_F1_F12 which is responsible for converting F1-F12 scancodes into media key scancodes when Fn isn’t pressed (unless Fnlock is enabled).
  • hotkey_special_key and functional_hotkey which are responsible for handling arrow keys (and various other keys) when Fn is pressed.

If keyboard_scancode_callback doesn’t make it as far as these three functions (hotkey_F1_F12, hotkey_special_key, functional_hotkey), then that would be consistent with the observed behaviour. Above the calls to these three functions are a few places where the function early-returns.

So, based on this, my suspicions are returning to ectool shenanigans, and/or the dmesg errors I’m seeing from the cros_ec_lpcs kernel module…

3 Likes

FWIW, the memory region in question (ER1) is also manipulated directly by ACPI on the AP side.

Excellent investigation! I also believe the pos bit to be implicated in this issue.

There is a small window during which in-flight EC exchanges can get corrupted; however, Custom Hello is sent before any code outside the firmware runs. I do not believe the same to be true of ACPI writes to this part of ER1.

I think the flow is that the firmware sends Custom Hello, and then the OS eventually triggers an ACPI method that sets that bit. That should happen early on in ACPI init, though, before any userland EC exchanges. :thinking:

(Edited for further speculation, and to flip the sense on whether or not I believe ACPI writes to ER1 happen before the firmware exits.)

2 Likes

For me it never happened after suspended/resume since I have this features fully disabled.

Sometimes it happens during typing/working, suddenly Fn doesn’t work anymore. And after a random amount of time, it works again. I only little investigated dmesg and systemd logs, but couldn’t see any suspicious log message.

Interesting, so this could (at least theoretically) be caused by stray ACPI-related writes?

This is kindof what I’m thinking. I’m regularly running ectool pwmgetfanrpm, which runs EC_CMD_GET_FEATURES (0x000D), and often see dmesg errors complaining about ec command checksums and unexpected length mismatches (presumably of the responses). So I think this corruption is actually happening reasonably often. If the command request id gets overwritten to be 0x3E02 or 0x3E07, then that could explain the symptoms.

Another possibility could be a buffer overflow accidentally overwriting the stuff in this region. Do you know what normally sits immediately below it?

I don’t understand this, I thought these EC commands like EC_CMD_CUSTOM_HELLO were commands that could be sent to the EC (and run on the EC) by the host?

Do you know if there’s any way to inspect the values in these regions from inside Linux? It would be very useful to be able to check these bits the next time the issue happens.

Similarly, do you know if there might be any workarounds or mitigations? eg. any way to forcibly set the values from inside Linux? Or otherwise perhaps “reset the EC on the fly”? (Or would that be a bad idea…? It seems like a bad idea :slight_smile:)

After not happening much last week, this week it is back to happening several times a day. I’m starting to wonder if it might somehow be environmental?! It seems more likely to occur when running on batteries, and/or when in a noisier RF environment, and/or when moving the laptop from one place to another (without closing it, eg. between rooms). It’s also still much more likely to spontaneously go into the bad mode (no Fn key), than back to working fine (though that has still happened a few times). I really wish there was a way to programmatically introspect the state of the EC.

1 Like

I’m pretty confident everything was working fine at my desk (plugged into HDMI/power/USB keyboard) on the weekend. I closed it there to suspend it, moved it, and opened it now on battery power with nothing attached and the FN key has failed again.

I’m not sure if there’s anything super relevant, but the dmesg output covering the suspend/wake up is here:

Dmesg output
[1051591.675758] audit: type=1326 audit(1679750645.047:14631): auid=4294967295 uid=1000 gid=1000 ses=4294967295 subj=snap.firefox.firefox pid=1628087 comm="CanvasRenderer" exe="/snap/firefox/2391/usr/lib/firefox/firefox" sig=0 arch=c000003e syscall=312 compat=0 ip=0x7f4d7a5b673d code=0x50000
[1051726.247909] kauditd_printk_skb: 2 callbacks suppressed
[1051726.247914] audit: type=1107 audit(1679750779.618:14634): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_method_call"  bus="system" path="/org/freedesktop/NetworkManager" interface="org.freedesktop.NetworkManager" member="GetDevices" mask="send" name="org.freedesktop.NetworkManager" pid=1628087 label="snap.firefox.firefox" peer_pid=1177 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1051857.796719] audit: type=1326 audit(1679750911.166:14635): auid=4294967295 uid=1000 gid=1000 ses=4294967295 subj=snap.firefox.firefox pid=1628087 comm="CanvasRenderer" exe="/snap/firefox/2391/usr/lib/firefox/firefox" sig=0 arch=c000003e syscall=312 compat=0 ip=0x7f4d7a5b673d code=0x50000
[1060031.148530] i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe B (start=141038 end=141039) time 184 us, min 1073, max 1079, scanline start 1070, end 1082
[1062527.137928] PDLOG 2023/03/25 16:26:21.422 P0 Disconnected
[1067871.686300] usb 3-9: reset full-speed USB device number 3 using xhci_hcd
[1067890.690160] usb 3-4: USB disconnect, device number 72
[1067902.877054] audit: type=1107 audit(1679766956.120:14636): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/login1" interface="org.freedesktop.login1.Manager" member="PrepareForSleep" name=":1.2" mask="receive" pid=1628087 label="snap.firefox.firefox" peer_pid=1094 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1067902.881925] audit: type=1107 audit(1679766956.124:14637): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/NetworkManager" interface="org.freedesktop.NetworkManager" member="StateChanged" name=":1.13" mask="receive" pid=67531 label="snap.cups.cups-browsed" peer_pid=1177 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1067902.934560] wlp166s0: deauthenticating from ##:##:##:##:##:## by local choice (Reason: 3=DEAUTH_LEAVING)
[1067903.435310] PM: suspend entry (s2idle)
[1067903.450293] Filesystems sync: 0.014 seconds
[1067903.493150] Freezing user space processes ... (elapsed 0.012 seconds) done.
[1067903.506034] OOM killer disabled.
[1067903.506038] Freezing remaining freezable tasks ... (elapsed 0.003 seconds) done.
[1067903.509126] printk: Suspending console(s) (use no_console_suspend to debug)
[1067903.753279] ACPI: EC: interrupt blocked
[1215055.759536] ACPI: EC: interrupt unblocked
[1215056.083865] i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
[1215056.083869] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[1215056.099672] i915 0000:00:02.0: [drm] HuC authenticated
[1215056.100335] i915 0000:00:02.0: [drm] GuC submission enabled
[1215056.100336] i915 0000:00:02.0: [drm] GuC SLPC enabled
[1215056.100940] i915 0000:00:02.0: [drm] GuC RC: enabled
[1215056.228328] OOM killer enabled.
[1215056.228332] Restarting tasks ... 
[1215056.228825] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-#####: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[1215056.229632] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-#####: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[1215056.233746] done.
[1215056.265412] PM: suspend exit
[1215056.333990] audit: type=1107 audit(1679914107.575:14638): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/login1" interface="org.freedesktop.login1.Manager" member="PrepareForSleep" name=":1.2" mask="receive" pid=1628087 label="snap.firefox.firefox" peer_pid=1094 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1215056.651256] audit: type=1107 audit(1679914107.891:14639): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/NetworkManager" interface="org.freedesktop.NetworkManager" member="StateChanged" name=":1.13" mask="receive" pid=67531 label="snap.cups.cups-browsed" peer_pid=1177 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1215056.713611] audit: type=1107 audit(1679914107.955:14640): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/NetworkManager" interface="org.freedesktop.NetworkManager" member="DeviceRemoved" name=":1.13" mask="receive" pid=67531 label="snap.cups.cups-browsed" peer_pid=1177 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1215056.715557] audit: type=1107 audit(1679914107.955:14641): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/NetworkManager" interface="org.freedesktop.NetworkManager" member="DeviceAdded" name=":1.13" mask="receive" pid=67531 label="snap.cups.cups-browsed" peer_pid=1177 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1215060.394095] audit: type=1107 audit(1679914111.635:14642): pid=1054 uid=103 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/NetworkManager" interface="org.freedesktop.NetworkManager" member="StateChanged" name=":1.13" mask="receive" pid=67531 label="snap.cups.cups-browsed" peer_pid=1177 peer_label="unconfined"
                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
[1215060.477853] wlp166s0: authenticate with ##:##:##:##:##:##
[1215060.477878] wlp166s0: Invalid HE elem, Disable HE
[1215060.492436] wlp166s0: send auth to ##:##:##:##:##:## (try 1/3)
[1215060.555683] wlp166s0: authenticated
[1215060.557415] wlp166s0: associate with ##:##:##:##:##:## (try 1/3)
[1215060.591031] wlp166s0: RX AssocResp from ##:##:##:##:##:## (capab=0x1811 status=0 aid=4)
[1215060.612744] wlp166s0: associated
[1215060.662736] wlp166s0: Limiting TX power to 20 (20 - 0) dBm as advertised by ##:##:##:##:##:##
[1215060.832612] IPv6: ADDRCONF(NETDEV_CHANGE): wlp166s0: link becomes ready

We will take a look at this issue and see what is going on.
Keyboard events go through a separate interface to the ACPI interface.
However the EC will disable all function keys if the OS is not in ACPI mode so if all function keys stop working, including the top row, this may be the cause.
The EC needs a host command from the bios on boot to enable function keys.

This is done in the host_command_customization.c
sci_enable(void)

1 Like

Thanks for taking a look!

The keyboard events are going normally through the 8042 interface, it’s just that the scancodes (and behaviour) are wrong.

I’m pretty sure ACPI is enabled, since this happens in the middle of using an otherwise completely normally functioning laptop (including features that I believe require ACPI, like suspend-resume). And so “Fn key stops working” does literally mean “while using the laptop”, and not “after some boots/resumes”.

Is there a way to check/confirm that Linux is still in ACPI mode?

I’ve looked at sci_enable, the problem appears to be that bit 0 of byte 0 of the “customer_memmap” region. The call to pos_get_state() in keyboard_scancode_callback() means that, when the Fn key is working, we can infer that the bit must be set. Which I think suggests that ACPI has been enabled and probably sci_enable has properly run (gone through its main path)?

But if that bit has somehow been cleared, then that could explain the symptoms. But also if factory_enable has somehow been set, then that could also explain it. My outsiders perspective detailed analysis is back on this earlier comment, in case it helps you get started…

1 Like

Previously, I tried increasing the frequency of how often ectool pwmgetfanrpm is run, to see if it affected the issue (it didn’t, or not noticably). Today I tried the opposite test - competely disabling those calls. I spent 90 minutes in a physical environment / situation where the problem has been readily reproducing for me, and everything was fine. I then re-enabled the ectool pwmgetfanrpm calls (which run every ~0.5 secs), without changing anything else (including the environment / situation), and within 10 minutes the problem had re-occurred and the Fn key was no longer working.

Based on this I’m now pretty convinced that, at least in my case, the ectool usage is causally related. At least one commenter previously said that they weren’t using ectool, but perhaps some other EC comms (eg. by one of the cros_ec* kernel modules?) can cause the problem (perhaps with a lower frequency / incidence).

I think this also correlates with all the dmesg errors I’m sometimes seeing about EC message checksum/packet length errors and ectool segfaults (though I don’t understand why I only sometimes get dmesg complaints about raw port io being locked down…). I’m pretty much back to my original hypotheses: an ectool command request getting mangled such that the EC ends up running 0x3E02 (EC_CMD_FACTORY_MODE) with non-zero arg, or 0x3E07 (EC_CMD_CUSTOM_HELLO), or, something else is causing *host_get_customer_memmap(0x00) & BIT(0) to be cleared or factory_enable to be set (eg. buffer overrun somewhere in the EC?).

I’m using the ectool from DHowett/framework-ec@d5b5b50. Is there a better/official version that I should use instead? For now I’m leaving the ectool fan reporting disabled, and will see if the problem still happens but less frequently…

Ok, I’ve confirmed that it can happen even without using ectool, it just takes a lot longer to occur. Since it happens without ectool, but more frequently with it, it seems unlikely to be something that the cros_ec_lpcs (or whatever) module is doing, and is more likely to be something on the EC side?

This has been happening to my 11th Gen (Gen 1, batch 1) for months. I’m a fairly low-speed, high-drag user running pop!OS, and I see it only in between sleeps. The only way I clear the error is with a reboot. I have not been bothered enough to do any real troubleshooting, but the more days between reboots, the more likely I will see it.

FWIW…

I’ve had the issue yesterday, maybe I should’ve posted here instead in the other thread. Now I’ve rebooted, so it’s gone away, but the dmesg should still be of interest:

https://community.frame.work/t/tracking-fn-lock-makes-both-fn-f-and-f-trigger-the-media-control-keys/26282/19?u=mapleleaf

I’ve also been seeing this issue recently, on an 11th gen running Ubuntu 22.10.

I’m pretty sure this never happened when I first got the laptop a year ago. I believe I’ve first seen this happen about a month ago, maybe a bit more. Looking back, it could have been a BIOS update (I updated to 3.17 in january, though I think I first saw the issue one or two months after that), of maybe an Ubuntu upgrade (I upgraded from 21.something to 22.10 one or two months ago).

As for ectool - I have a version compiled and installed that I have been using to manually check the ec console every now and then, but I’m not using ectool periodically for anything, so that cannot be the cause for me.

I’m also seeing the packet too long errors (and saw bad checksum once) in dmesg.

As for when this happens - I have the feeling it is related to suspending or usb-c docking/undocking, but I also realize there is a bias there - when I’m docked, I use a different keyboard, so it might very well be broken already when I dock, but I won’t notice until I undock, suspend my laptop and then notice when I unsuspend again when I’m on the road.

There is definitely something going on regarding uptime. I recently rebooted after ~70 days. Before doing that, I reinstated my ectool polling, and the problem easily re-occurred in under 10 mins. However, after rebooting, with the same setup, I’m unable to make the problem re-occur, even after running the same polling for several hours.

The main difference that I’m aware of (apart from how long since the system had been rebooted) is a slightly more recent Ubuntu 22.04 kernel: I’m now on 6.0.0-1014-oem, whereas before it was 6.0.0-1010-oem. I guess another difference is that the system is generally more lightly loaded than before (less processes, cpu, and memory currently in use). Over time I’ll see if the problem eventually reoccurs on this slightly more recent kernel version.

1 Like

This also started happened to me on Ubuntu 22.10 shortly after updating firmware from 0.0.3.4 to 0.0.3.17. I’ve downgraded to 0.0.3.10 and will report back if this seems to resolve the issue.