[TRACKING] Kernel regression: WiFi-related hangs, fails to suspend with kernel 6.6.5 - fixed in 6.6.6

Yeah I just wasted 4 hours to understand why my laptop was suddently not working anymore before realizing my kernel was updated.

Could you tell me what’s the procedure to get the rawhide kernel on F39 ?

https://fedoraproject.org/wiki/RawhideKernelNodebug

Although you can also just rollback to the previous kernel in stable by doing:

sudo dnf downgrade kernel*

and you can also use the sentry-fsync copr ;

sentry/kernel-fsync Copr

as an alternative to rawhide. Unless you plan on testing patches I would probably advise against rawhide.

2 Likes

Thanks @jwp for the links. It helps a lot !

@jwp I’ve been fumbling in the dark given my complete lack of any recent kernel building experience (especially including the distro workflow), please take this with a generous dose of salt.

Given that:

  • regression is between 6.6.4 and 6.6.5
  • it’s possibly a deadlock? (I see mutex acquisition calls in the stack)
  • specific to the mediatek wireless/BT adapter based on all reports I’ve seen so far
  • is apparently “fixed” in 6.7-rc4

I looked at the changes between 6.6.4 and 6.6.5. There’s nothing mediatek-specific, but there’s this one that touches some common wireless code.

6.7-rc4 seems to incorporate more extensive changes in the wireless stack and drivers. Among them there’s this one which seems to be adding some more mutex acquisition to that same file and, if my git-fu isn’t failing me, is not part of 6.6.5.

Just putting all this here in case it’s useful.

3 Likes

Same thing happened to me in Arch with 6.6.5 with Mediatek, had to downgrade to 6.6.4. I saw problems with suspend, wifi disconnecting after a while, desktop freezing after a while and reboot taking a long time.

This is getting traction in the kernel devs mailing list (link in the bugzilla referenced in OP if you’re curious). So hopefully we’ll see an official fix soon.

Me too, same problem with arch and kernel 6.6.5

Thread title needs changed, as this is not specific to AMD mainboards. I have the same bug on i5-11th Gen Intel + MT7921K (RZ608 clone). Like others, started when 6.6.4 was upgraded to 6.6.5.

Kernel dev discussion thread here: Re: [PATCH 6.6 074/134] wifi: cfg80211: fix CQM for non-range use - Sven Joachim

Bug being tracked here: 218247 – brcmfmac wifi stopped working with kernel 6.6.5

Thanks a lot for the find, this was driving me crazy on a sunday evening :slight_smile:

2 Likes

I’ve also seen this, in arch it’s for both the current mainline and the current LTS, interestingly. I guess the patch was seen as low risk?

4 Likes

Oooh.
Linux 6.6.6
The Kernel of the Beast. :fearful:

4 Likes

Thanks for sharing the link, I think this has been the source of my pains.

A few days ago I had an update to LTS kernel from 6.1.57 to 6.1.66 and suddenly my laptop started hanging. I traced the issue to bunch of these:

INFO: task wpa_supplicant:4349 blocked for more than 120 seconds.
      Tainted: G           O       6.6.5-gentoo-x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:wpa_supplicant  state:D stack:0     pid:4349  ppid:1      flags:0x00000002
Call Trace:
 <TASK>
 __schedule+0x3a5/0xab0
 schedule+0x61/0xe0
 schedule_preempt_disabled+0x14/0x30
 __mutex_lock.constprop.0+0x392/0x6c0
 ? srso_alias_return_thunk+0x5/0x7f
 nl80211_authenticate+0x350/0x3d0 [cfg80211]
 genl_family_rcv_msg_doit+0x108/0x180
 genl_rcv_msg+0x1d6/0x2e0
 ? __pfx_nl80211_pre_doit+0x10/0x10 [cfg80211]
 ? __pfx_nl80211_authenticate+0x10/0x10 [cfg80211]
 ? __pfx_nl80211_post_doit+0x10/0x10 [cfg80211]
 ? __pfx_genl_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 genl_rcv+0x24/0x40
 netlink_unicast+0x19f/0x290
 netlink_sendmsg+0x260/0x500
 ____sys_sendmsg+0x3c0/0x400
 ? copy_msghdr_from_user+0x8b/0xd0
 ___sys_sendmsg+0xa5/0x100
 __sys_sendmsg+0x9e/0x110
 ? _copy_from_user+0x2b/0x70
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fdc5d931a30
RSP: 002b:00007ffe3bcbe398 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000055b23372e120 RCX: 00007fdc5d931a30
RDX: 0000000000000000 RSI: 00007ffe3bcbe3d0 RDI: 0000000000000005
RBP: 000055b2337c0d60 R08: 0000000000000004 R09: 000000000000000d
R10: 00007ffe3bcbe4b4 R11: 0000000000000202 R12: 000055b23372e030
R13: 00007ffe3bcbe3d0 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Which seemed to suggest issues related to the network adapter modules. Also appears to be confirmed in the kernel discussion thread as identified in 6.1.66 as well. It also explains why trying out 6.6.x series didn’t work to resolve the issue.

Kernel of the Beast indeed! I’m never upgrading…from 6.6.6.

4 Likes

Manjaro updated to 6.6.5 yesterday and I had all sorts of problems with hangs and crashes after suspending. Dropped back to 6.5.13 for the moment.

@Matt_Hartley AMD with the default wifi card. Manjaro has since updated to 6.6.6 and seems to be working properly.

Appreciate this everyone - can I get a list of wireless cards in use as you see this, so I can replicate and file a bug report, track, etc.

Distro: (Ideally something we test against like Fedora)

Wi-Fi card brand/model: Mediatek or Intel, if Intel, which?

Laptop model: Framework 13 11th, 12th, 13th or AMD?

It seems to have affected any distro that picked up 6.6.5 (which never made it to stable on Fedora).

It’s been fixed as of kernel 6.6.6, but if needed for further tracking:

Fedora 39
Mediatek 7922 (aka AMD RZ616)
Framework 13 AMD

Also seen elsewhere on different hardware (kernel bugzilla issue is on a Pinebook with Broadcom WiFi), affected code wasn’t hardware-specific but seems some hardware’s behavior triggered the bug.

1 Like

Very helpful, thank you. This gives us a jumping off point.

Can you confirm that you’re running 6.6.6 or later? uname -r

Because F39 seems to have an (installation-time?) bug whereby the configuration is not bootstrapped to turn on “when installing kernel version make it the default”

See thread right about here:

If you’re hitting that issue, after fixing/creating the config file referenced there, you can either reinstall 6.6.6

sudo dnf reinstall kernel{,-core,-modules,-modules-core,-modules-extra}-6.6.6

or go for 6.6.7 which is already looking good in updates-testing

sudo dnf --enablerepo updates-testing --refresh --setopt=fastestmirror=true up 'kernel*'

To trigger the make-default logic.

Just to be clear, what is the symptom of the hang, other than the logs?

NetworkManager-dispatcher and packagekit are also exiting successfully here. As far as nm-d goes, that’s how I always remember it, it’s used on network changes IIRC. I never paid much attention to packagekit in the logs. It’s normal for some systemd units to run just once, or periodically, or on some other trigger, they’re not all “classic” daemons.

$ systemctl status NetworkManager-dispatcher.service 
○ NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service
     Loaded: loaded (/usr/lib/systemd/system/NetworkManager-dispatcher.service; enabled; preset: enabled)
...
Dec 17 19:03:49 angua systemd[1]: Started NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service.
Dec 17 19:04:03 angua systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
$ systemctl status packagekit.service 
○ packagekit.service - PackageKit Daemon
     Loaded: loaded (/usr/lib/systemd/system/packagekit.service; static)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: inactive (dead)

Dec 17 13:21:05 angua systemd[1]: packagekit.service: Deactivated successfully.
Dec 17 19:04:09 angua systemd[1]: Starting packagekit.service - PackageKit Daemon...
Dec 17 19:04:09 angua PackageKit[22717]: daemon start
Dec 17 19:04:09 angua systemd[1]: Started packagekit.service - PackageKit Daemon.
Dec 17 19:04:10 angua PackageKit[22717]: get-updates transaction /5438_ddbabdbc from uid 1000 finished with success after 641ms
Dec 17 19:04:10 angua PackageKit[22717]: get-updates transaction /5439_daaccacb from uid 1000 finished with success after 77ms
Dec 17 19:04:39 angua PackageKit[22717]: get-updates transaction /5440_caaebcbe from uid 1000 finished with success after 78ms
Dec 17 19:04:39 angua PackageKit[22717]: get-updates transaction /5441_abdbdebb from uid 1000 finished with success after 77ms
Dec 17 19:09:44 angua PackageKit[22717]: daemon quit
Dec 17 19:09:44 angua systemd[1]: packagekit.service: Deactivated successfully.

NetworkManager.service on the other hand is a long-running daemon

$ systemctl status NetworkManager.service 
● NetworkManager.service - Network Manager
     Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: active (running) since Sun 2023-12-17 09:51:46 PST; 9h ago
...