[TRACKING] Kernel regression: WiFi-related hangs, fails to suspend with kernel 6.6.5 - fixed in 6.6.6

Update: This is addressed in kernel 6.6.6 the number of the fix Maiden exposure at tender age, sorry.

Earlier update: summary so far: with the Mediatek WiFi/BT card enabled in the BIOS as normal, machine won’t suspend. This goes away if the card is disabled in BIOS, or with the latest rawhide 6.7-rc4 kernel (but not sure if fix is incoming for 6.6 series yet).

Also, in the bodhi thread at least one ThinkPad 7840U user with a non-Mediatek card confirms no issues.

kernel update in bodhi

bugzilla entry

Dec 08 23:14:01 kernel: PM: suspend entry (s2idle)
Dec 08 23:14:01 kernel: Filesystems sync: 0.028 seconds
Dec 08 23:14:22 kernel: Freezing user space processes
Dec 08 23:14:22 kernel: Freezing user space processes failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
Dec 08 23:14:22 kernel: task:wpa_supplicant  state:D stack:0     pid:2331  ppid:1      flags:0x00004006
Dec 08 23:14:22 kernel: Call Trace:
Dec 08 23:14:22 kernel:  <TASK>
Dec 08 23:14:22 kernel:  __schedule+0x3ed/0x14c0
Dec 08 23:14:22 kernel:  ? sysvec_apic_timer_interrupt+0xe/0x90
Dec 08 23:14:22 kernel:  schedule+0x5e/0xd0
Dec 08 23:14:22 kernel:  schedule_preempt_disabled+0x15/0x30
Dec 08 23:14:22 kernel:  __mutex_lock.constprop.0+0x39a/0x6a0
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? __nla_reserve+0x3c/0x50
Dec 08 23:14:22 kernel:  nl80211_send_iface+0x25b/0x980 [cfg80211]
Dec 08 23:14:22 kernel:  ? kmalloc_reserve+0x62/0xf0
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? __alloc_skb+0xde/0x1a0
Dec 08 23:14:22 kernel:  nl80211_get_interface+0x4f/0xa0 [cfg80211]
Dec 08 23:14:22 kernel:  genl_family_rcv_msg_doit+0xef/0x150
Dec 08 23:14:22 kernel:  genl_rcv_msg+0x1b3/0x2c0
Dec 08 23:14:22 kernel:  ? __pfx_nl80211_pre_doit+0x10/0x10 [cfg80211]
Dec 08 23:14:22 kernel:  ? __pfx_nl80211_get_interface+0x10/0x10 [cfg80211]
Dec 08 23:14:22 kernel:  ? __pfx_nl80211_post_doit+0x10/0x10 [cfg80211]
Dec 08 23:14:22 kernel:  ? __pfx_genl_rcv_msg+0x10/0x10
Dec 08 23:14:22 kernel:  netlink_rcv_skb+0x58/0x110
Dec 08 23:14:22 kernel:  genl_rcv+0x28/0x40
Dec 08 23:14:22 kernel:  netlink_unicast+0x1a3/0x290
Dec 08 23:14:22 kernel:  netlink_sendmsg+0x254/0x4d0
Dec 08 23:14:22 kernel:  ____sys_sendmsg+0x396/0x3d0
Dec 08 23:14:22 kernel:  ? copy_msghdr_from_user+0x7d/0xc0
Dec 08 23:14:22 kernel:  ___sys_sendmsg+0x9a/0xe0
Dec 08 23:14:22 kernel:  __sys_sendmsg+0x7a/0xd0
Dec 08 23:14:22 kernel:  do_syscall_64+0x5d/0x90
Dec 08 23:14:22 kernel:  ? __sys_setsockopt+0xf2/0x1d0
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? do_syscall_64+0x6c/0x90
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? do_syscall_64+0x6c/0x90
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? do_syscall_64+0x6c/0x90
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
Dec 08 23:14:22 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Dec 08 23:14:22 kernel:  ? do_syscall_64+0x6c/0x90
Dec 08 23:14:22 kernel:  ? do_syscall_64+0x6c/0x90
Dec 08 23:14:22 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Dec 08 23:14:22 kernel: RIP: 0033:0x7fb08d535a24
Dec 08 23:14:22 kernel: RSP: 002b:00007ffeb2be7d58 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
Dec 08 23:14:22 kernel: RAX: ffffffffffffffda RBX: 000056283e26f3c0 RCX: 00007fb08d535a24
Dec 08 23:14:22 kernel: RDX: 0000000000000000 RSI: 00007ffeb2be7d90 RDI: 0000000000000006
Dec 08 23:14:22 kernel: RBP: 00007ffeb2be7d80 R08: 0000000000000004 R09: 0000000000000001
Dec 08 23:14:22 kernel: R10: 00007ffeb2be7e90 R11: 0000000000000202 R12: 000056283e300020
Dec 08 23:14:22 kernel: R13: 000056283e271de0 R14: 00007ffeb2be7d90 R15: 000056283cd90050
Dec 08 23:14:22 kernel:  </TASK>
Dec 08 23:14:22 kernel: OOM killer enabled.
Dec 08 23:14:22 kernel: Restarting tasks ... done.
Dec 08 23:14:22 kernel: random: crng reseeded on system resumption
Dec 08 23:14:22 kernel: PM: suspend exit
3 Likes

The picture so far (see latest comments in bodhi and bugzilla threads) point to the Mediatek adapter/driver. Current rawhide kernel does not show the issue, FWIW.

2 Likes

Yeah I just wasted 4 hours to understand why my laptop was suddently not working anymore before realizing my kernel was updated.

Could you tell me what’s the procedure to get the rawhide kernel on F39 ?

https://fedoraproject.org/wiki/RawhideKernelNodebug

Although you can also just rollback to the previous kernel in stable by doing:

sudo dnf downgrade kernel*

and you can also use the sentry-fsync copr ;

sentry/kernel-fsync Copr

as an alternative to rawhide. Unless you plan on testing patches I would probably advise against rawhide.

2 Likes

Thanks @jwp for the links. It helps a lot !

@jwp I’ve been fumbling in the dark given my complete lack of any recent kernel building experience (especially including the distro workflow), please take this with a generous dose of salt.

Given that:

  • regression is between 6.6.4 and 6.6.5
  • it’s possibly a deadlock? (I see mutex acquisition calls in the stack)
  • specific to the mediatek wireless/BT adapter based on all reports I’ve seen so far
  • is apparently “fixed” in 6.7-rc4

I looked at the changes between 6.6.4 and 6.6.5. There’s nothing mediatek-specific, but there’s this one that touches some common wireless code.

6.7-rc4 seems to incorporate more extensive changes in the wireless stack and drivers. Among them there’s this one which seems to be adding some more mutex acquisition to that same file and, if my git-fu isn’t failing me, is not part of 6.6.5.

Just putting all this here in case it’s useful.

3 Likes

Same thing happened to me in Arch with 6.6.5 with Mediatek, had to downgrade to 6.6.4. I saw problems with suspend, wifi disconnecting after a while, desktop freezing after a while and reboot taking a long time.

This is getting traction in the kernel devs mailing list (link in the bugzilla referenced in OP if you’re curious). So hopefully we’ll see an official fix soon.

Me too, same problem with arch and kernel 6.6.5

Thread title needs changed, as this is not specific to AMD mainboards. I have the same bug on i5-11th Gen Intel + MT7921K (RZ608 clone). Like others, started when 6.6.4 was upgraded to 6.6.5.

Kernel dev discussion thread here: Re: [PATCH 6.6 074/134] wifi: cfg80211: fix CQM for non-range use - Sven Joachim

Bug being tracked here: 218247 – brcmfmac wifi stopped working with kernel 6.6.5

Thanks a lot for the find, this was driving me crazy on a sunday evening :slight_smile:

2 Likes

I’ve also seen this, in arch it’s for both the current mainline and the current LTS, interestingly. I guess the patch was seen as low risk?

4 Likes

Oooh.
Linux 6.6.6
The Kernel of the Beast. :fearful:

4 Likes

Thanks for sharing the link, I think this has been the source of my pains.

A few days ago I had an update to LTS kernel from 6.1.57 to 6.1.66 and suddenly my laptop started hanging. I traced the issue to bunch of these:

INFO: task wpa_supplicant:4349 blocked for more than 120 seconds.
      Tainted: G           O       6.6.5-gentoo-x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:wpa_supplicant  state:D stack:0     pid:4349  ppid:1      flags:0x00000002
Call Trace:
 <TASK>
 __schedule+0x3a5/0xab0
 schedule+0x61/0xe0
 schedule_preempt_disabled+0x14/0x30
 __mutex_lock.constprop.0+0x392/0x6c0
 ? srso_alias_return_thunk+0x5/0x7f
 nl80211_authenticate+0x350/0x3d0 [cfg80211]
 genl_family_rcv_msg_doit+0x108/0x180
 genl_rcv_msg+0x1d6/0x2e0
 ? __pfx_nl80211_pre_doit+0x10/0x10 [cfg80211]
 ? __pfx_nl80211_authenticate+0x10/0x10 [cfg80211]
 ? __pfx_nl80211_post_doit+0x10/0x10 [cfg80211]
 ? __pfx_genl_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 genl_rcv+0x24/0x40
 netlink_unicast+0x19f/0x290
 netlink_sendmsg+0x260/0x500
 ____sys_sendmsg+0x3c0/0x400
 ? copy_msghdr_from_user+0x8b/0xd0
 ___sys_sendmsg+0xa5/0x100
 __sys_sendmsg+0x9e/0x110
 ? _copy_from_user+0x2b/0x70
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fdc5d931a30
RSP: 002b:00007ffe3bcbe398 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000055b23372e120 RCX: 00007fdc5d931a30
RDX: 0000000000000000 RSI: 00007ffe3bcbe3d0 RDI: 0000000000000005
RBP: 000055b2337c0d60 R08: 0000000000000004 R09: 000000000000000d
R10: 00007ffe3bcbe4b4 R11: 0000000000000202 R12: 000055b23372e030
R13: 00007ffe3bcbe3d0 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Which seemed to suggest issues related to the network adapter modules. Also appears to be confirmed in the kernel discussion thread as identified in 6.1.66 as well. It also explains why trying out 6.6.x series didn’t work to resolve the issue.

Kernel of the Beast indeed! I’m never upgrading…from 6.6.6.

4 Likes

Manjaro updated to 6.6.5 yesterday and I had all sorts of problems with hangs and crashes after suspending. Dropped back to 6.5.13 for the moment.

@Matt_Hartley AMD with the default wifi card. Manjaro has since updated to 6.6.6 and seems to be working properly.

Appreciate this everyone - can I get a list of wireless cards in use as you see this, so I can replicate and file a bug report, track, etc.

Distro: (Ideally something we test against like Fedora)

Wi-Fi card brand/model: Mediatek or Intel, if Intel, which?

Laptop model: Framework 13 11th, 12th, 13th or AMD?

It seems to have affected any distro that picked up 6.6.5 (which never made it to stable on Fedora).

It’s been fixed as of kernel 6.6.6, but if needed for further tracking:

Fedora 39
Mediatek 7922 (aka AMD RZ616)
Framework 13 AMD

Also seen elsewhere on different hardware (kernel bugzilla issue is on a Pinebook with Broadcom WiFi), affected code wasn’t hardware-specific but seems some hardware’s behavior triggered the bug.

1 Like

Very helpful, thank you. This gives us a jumping off point.