[RESPONDED] Suspend to memory broken on linux-6.4.11 (debian trixie) due to over-current on xHCI bus

[UPDATE: see my most recent comment for what i believe is the actual root cause, with detailed kernel dyndbg logs and concrete suggestion from a linux kernel contact – [RESPONDED] Suspend to memory broken on linux-6.4.11 (debian trixie) due to over-current on xHCI bus - #6 by khimaros]

this seems to be a regression with recent kernels or perhsps with recent dynamically loaded firmware.

i’ve removed all USB modules from the laptop, nothing is plugged in. bluetooth is disabled.

i’m running the latest stable Insyde firmware and EC:

# dmidecode | head
BIOS Information
	Vendor: INSYDE Corp.
	Version: 03.17
	Release Date: 10/27/2022

my hardware is the i7-1185G7 model:

# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 140
model name	: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

i’ve also tried with both deep and s2idle suspend, the same result.

the key seems to be this line in dmesg:

[37767.143371] usb usb3: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
[37767.143388] usb usb3: PM: failed to suspend async: error -16

i notice that the fingerprint reader and builtin bluetooth are on bus 003:

Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 003: ID 27c6:609c Shenzhen Goodix Technology Co.,Ltd. Goodix USB2.0 MISC
Bus 003 Device 004: ID 8087:0032 Intel Corp. AX210 Bluetooth
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

more context is here

[37764.783763] PM: suspend entry (deep)
[37764.795359] Filesystems sync: 0.011 seconds
[37764.795518] (NULL device *): firmware: direct-loading firmware iwlwifi-ty-a0-gf-a0.pnvm
[37764.795547] (NULL device *): firmware: direct-loading firmware regulatory.db
[37764.795551] (NULL device *): firmware: direct-loading firmware regulatory.db.p7s
[37764.795561] (NULL device *): firmware: direct-loading firmware i915/tgl_dmc_ver2_12.bin
[37764.795570] (NULL device *): firmware: direct-loading firmware intel/ibt-0041-0041.ddc
[37764.795695] (NULL device *): firmware: direct-loading firmware intel/ibt-0041-0041.sfi
[37764.795801] (NULL device *): firmware: direct-loading firmware iwlwifi-ty-a0-gf-a0-74.ucode
[37764.946340] Freezing user space processes
[37764.948284] Freezing user space processes completed (elapsed 0.001 seconds)
[37764.948287] OOM killer disabled.
[37764.948288] Freezing remaining freezable tasks
[37764.949497] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[37764.949537] printk: Suspending console(s) (use no_console_suspend to debug)
[37764.971991] usb usb3: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
[37764.972026] usb usb3: PM: failed to suspend async: error -16
[37765.154149] PM: Some devices failed to suspend, or early wake event detected
[37765.166198] iwlwifi 0000:aa:00.0: WRT: Invalid buffer destination
[37765.331405] iwlwifi 0000:aa:00.0: WFPM_UMAC_PD_NOTIFICATION: 0x1f
[37765.331426] iwlwifi 0000:aa:00.0: WFPM_LMAC2_PD_NOTIFICATION: 0x1f
[37765.331437] iwlwifi 0000:aa:00.0: WFPM_AUTH_KEY_0: 0x80
[37765.331449] iwlwifi 0000:aa:00.0: CNVI_SCU_SEQ_DATA_DW9: 0x0
[37766.881362] OOM killer enabled.
[37766.881364] Restarting tasks ... done.
[37766.884326] random: crng reseeded on system resumption
[37766.892032] PM: suspend exit
[37766.892079] PM: suspend entry (s2idle)
[37766.896925] Filesystems sync: 0.004 seconds
[37766.938387] Freezing user space processes
[37766.940544] Freezing user space processes completed (elapsed 0.002 seconds)
[37766.940547] OOM killer disabled.
[37766.940548] Freezing remaining freezable tasks
[37767.010260] Freezing remaining freezable tasks completed (elapsed 0.069 seconds)
[37767.010264] printk: Suspending console(s) (use no_console_suspend to debug)
[37767.143371] usb usb3: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
[37767.143388] usb usb3: PM: failed to suspend async: error -16
[37767.310168] PM: Some devices failed to suspend, or early wake event detected
[37767.617317] OOM killer enabled.
[37767.617318] Restarting tasks ... done.
[37767.619760] random: crng reseeded on system resumption
[37767.626971] PM: suspend exit

any thoughts on how to further troubleshoot?

i also tried disabling external ports, WiFi, bluetooth, and fingerprint in the BIOS and the same issue occurs but this time the wake comes from usb1 instead of usb3:

# lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

# dmesg
[   46.058500] PM: suspend entry (deep)
[   46.081038] Filesystems sync: 0.022 seconds
[   46.081356] (NULL device *): firmware: direct-loading firmware i915/tgl_dmc_ver2_12.bin
[   46.117034] Freezing user space processes
[   46.119284] Freezing user space processes completed (elapsed 0.002 seconds)
[   46.119290] OOM killer disabled.
[   46.119290] Freezing remaining freezable tasks
[   46.120195] Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
[   46.120234] printk: Suspending console(s) (use no_console_suspend to debug)
[   46.143610] usb usb1: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
[   46.143630] usb usb1: PM: failed to suspend async: error -16
[   46.168896] PM: Some devices failed to suspend, or early wake event detected
[   46.282869] OOM killer enabled.
[   46.282878] Restarting tasks ... done.
[   46.287071] random: crng reseeded on system resumption
[   46.290529] PM: suspend exit
[   46.290570] PM: suspend entry (s2idle)
[   46.302710] Filesystems sync: 0.012 seconds
[   46.324693] Freezing user space processes
[   46.326374] Freezing user space processes completed (elapsed 0.001 seconds)
[   46.326377] OOM killer disabled.
[   46.326378] Freezing remaining freezable tasks
[   46.327474] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   46.327476] printk: Suspending console(s) (use no_console_suspend to debug)
[   46.491427] usb usb1: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
[   46.491469] usb usb1: PM: failed to suspend async: error -16
[   46.516852] PM: Some devices failed to suspend, or early wake event detected
[   46.630211] OOM killer enabled.
[   46.630215] Restarting tasks ... done.
[   46.636278] random: crng reseeded on system resumption
[   46.638935] PM: suspend exit

powertop confirms that none of the USB have wakeup enabled.

# cat /sys/bus/usb/devices/*/power/wakeup
disabled
disabled
disabled
disabled
disabled

however, XHCI wakeup does seem to be enabled:

# grep enabled /proc/acpi/wakeup 
PEG0	  S4	*enabled   pci:0000:00:06.0
XHCI	  S4	*enabled   pci:0000:00:14.0
RP10	  S4	*enabled   pci:0000:00:1d.0
TXHC	  S4	*enabled   pci:0000:00:0d.0
TDM0	  S4	*enabled   pci:0000:00:0d.2
TDM1	  S4	*enabled   pci:0000:00:0d.3
TRP0	  S4	*enabled   pci:0000:00:07.0
TRP1	  S4	*enabled   pci:0000:00:07.1
TRP2	  S4	*enabled   pci:0000:00:07.2
TRP3	  S4	*enabled   pci:0000:00:07.3
PWRB	  S4	*enabled   platform:PNP0C0C:00

i’ve captured a full cold boot and suspend attempt while all optional peripherals were disabled in the BIOS (same as my previous comment).

these are the only messages related to the failing USB bus:

# grep usb1 boot-result.txt 
Aug 28 05:20:54 <ELIDED> kernel: usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.04
Aug 28 05:20:54 <ELIDED> kernel: usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Aug 28 05:20:54 <ELIDED> kernel: usb usb1: Product: xHCI Host Controller
Aug 28 05:20:54 <ELIDED> kernel: usb usb1: Manufacturer: Linux 6.4.0-3-amd64 xhci-hcd
Aug 28 05:20:54 <ELIDED> kernel: usb usb1: SerialNumber: 0000:00:14.0
Aug 28 05:20:54 <ELIDED> kernel: usb usb1-port1: over-current condition
Aug 28 05:20:54 <ELIDED> kernel: usb usb1-port3: over-current condition
Aug 28 05:21:04 <ELIDED> kernel: usb usb1: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
Aug 28 05:21:04 <ELIDED> kernel: usb usb1: PM: failed to suspend async: error -16
Aug 28 05:21:04 <ELIDED> kernel: usb usb1: PM: dpm_run_callback(): usb_dev_suspend+0x0/0x20 [usbcore] returns -16
Aug 28 05:21:04 <ELIDED> kernel: usb usb1: PM: failed to suspend async: error -16

full dmesg output can be found here: Debian Pastezone

potentially related, as i’ve seen the kworker pm at 100% in some cases after attempting to suspend:

a slightly more interesting dmesg log:

Aug 28 06:55:49 <ELIDED> kernel: usb: port power management may be unreliable
Aug 28 06:55:49 <ELIDED> kernel: usb usb3-port1: over-current condition
Aug 28 06:55:49 <ELIDED> kernel: usb usb3-port3: over-current condition

disconnecting AC power and letting the battery drain down to 50% didn’t make a difference as with some of the users in the previously mentioned threads.

unloading the xhci_pci kernel module (modprobe -r xhci_pci) prior to suspend reliably works around the issue.

however, i’m providing more detail in the hopes that the deeper problems with the insyde firmware can be resolved.

@Matt_Hartley: tagging you since you seem to be the resident power management guru. tl;dr, it seems like something is triggering an over-current condition in the xHCI hub.

i believe this is the line in the kernel that is preventing suspend due to over-current condition: linux/drivers/usb/host/xhci-hub.c at 2dde18cd1d8fac735875f2e4987f11817cc0bc2c · torvalds/linux · GitHub

however, speaking with someone familiar with the internals of ACPI in the linux kernel, insyde is a very common source of non-compliant power issues. they believe that the issue lies somewhere deeper:

it sounds like either 1) the ACPI tables correctly mark the ports as unconnected but kernel doesn’t use that info or 2) ACPI incorrectly says both “unconnected” and “active” and confuses the kernel, hence it saying “may be unreliable”

i believe we have a concrete bug report for @Matt_Hartley now. with dyndbg enabled for usb/core/port.c, we see the following:

# echo "file drivers/usb/core/port.c +p" | sudo dd of=/sys/kernel/debug/dynamic_debug/control

# modprobe -r xhci_pci

# modprobe xhci_pci

# dmesg | grep peer
[ 1606.201010] usb usb4-port1: peered to usb3-port2
[ 1606.201140] usb: failed to peer usb4-port2 and usb3-port2 by location (usb4-port2:none) (usb3-port2:usb4-port1)
[ 1606.201143] usb usb4-port2: failed to peer to usb3-port2 (-16)
[ 1606.201257] usb: failed to peer usb4-port3 and usb3-port2 by location (usb4-port3:none) (usb3-port2:usb4-port1)
[ 1606.201260] usb usb4-port3: failed to peer to usb3-port2 (-16)
[ 1606.201378] usb: failed to peer usb4-port4 and usb3-port2 by location (usb4-port4:none) (usb3-port2:usb4-port1)
[ 1606.201380] usb usb4-port4: failed to peer to usb3-port2 (-16)

based on this result, my linux kernel contact suggested:

Next step is to report this as a bug to framework and suggest they double-check the ACPI tables, and refer them to that commit message about this that clued us in (usb: Quiet down false peer failure messages - kernel/git/torvalds/linux.git - Linux kernel source tree), and also explain my theory about the over-current actual cause of a floating ADC output (Analog to Digital Converter)

if these unconnected ports were correctly configured as such in the firmware tables then the kernel would not try to link them from the XHCI to EHCI controller as it is doing

here’s the lsusb output:

Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 0483:a2ca STMicroelectronics Solo 3.1.2
Bus 003 Device 003: ID 8087:0032 Intel Corp. AX210 Bluetooth
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

I’m the kernel/ACPI hacker looking into this. There is a very informative commit message around the port.c::link_peers() code that explains the core issue:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/core/port.c?id=6406eeb3f5bb376c7d9674e61f8da34ce7f05e8d

Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the XHCI
hub has 8 ports. Most of those ports are of connect type USB_PORT_NOT_USED
(including port 1 of the EHCI hub).
Further the unused ports have location data initialized to 0x80000000.
Now each unused port on the xhci hub walks the port list and finds a matching
peer with port1 of the EHCI hub because the zero’d out group id bits falsely match.
After port1 of the XHCI hub, each following matching peer will generate the
above warning

Combined with the over-current warnings I believe what is happening is this: from an electrical point of view, over-current means current was detected. How? obviously, a current detection circuit that has an ADC so the host can read the value. If we have unconnected ports that the kernel thinks are active the ADC may well report a floating value since the port maybe isn’t tied to ground, which could cause this kind of issue.

2 Likes

Thanks @khimaros, looks like @iam_tj is looking into it from the kernel side of things.

We’re working releasing a BIOS update and we’d want to test against that.

@Matt_Hartley Do you happen to have access to the source of the SSDTs, specifically those dealing with the USB host controller/port configuration? It looks like:
ACPI: SSDT 0x0000000075BE7000 0011C3 (v02 INSYDE UsbCTabl 00001000 INTL 20160422)

That would help if the source has additional commentary compared to decompiling the DSDT/SSDTs off @khimaros PC (I do not have a Framework device; I hack on Debian and kernel ‘stuff’ - been working with/on ACPI since around 2008).

1 Like

that’s right. @iam_tj is the person i was working with to reach this point. many thanks to them for their patience and tenacity!

i’ve sent the binary ACPI tables to them to help with digging further into the root cause but i ultimately the fix will likely need some BIOS work.

@Matt_Hartley i may have misspoken about the BIOS version. i’m running 3.17. is there a newer version that i’m glossing over in that thread?

@Matt_Hartley To be more specific, the ACPI Source Language (ASL) for SSDT7 (Intel “xh_tudd4”) and SSDT8 (Insyde “UsbCTabl”) - these contain the _PLD and _UPC methods that determine the placement and capabilities of each USB port.

The source should be using the ToPLD() macro to make the bitmapped value in the buffer (struct) easy to understand - reverse-engineering each one from the bitmaps is time-consuming and prone to errors.

For an example see sections 6.1 (_PLD) and 9.2 (_UPC) of the ACPI specification v6.5.

Source-code of the ToPLD() macro is at https://github.com/acpica/acpica/blob/master/source/compiler/aslpld.c

I am a Debian testing user on a 12th Gen. Seeing users, kernel hackers, and Framework peeps working together like this blows me away. Don’t mind me; I’m here cheering you all on from the side!

while not ideal, here’s a workaround for this issue:

# cat /usr/lib/systemd/system-sleep/xhci-workaround
#!/bin/sh
set -e

[ "$1" = "pre" ]  && exec /usr/sbin/modprobe -r xhci_pci
[ "$1" = "post" ] && exec /usr/sbin/modprobe xhci_pci

exit 0

N.B. it will prevent you from waking up your laptop using USB devices such as a keyboard or mouse.

1 Like

I do not, I’ll tag in @Kieran_Levin to see if he has insight.

For those using Debian, this appears to be a workaround from @khimaros