[RESPONDED] Strange WLAN problems with kernel branch > 6.2

Hi,

i bought a Framework13 laptop with AMD Ryzen 7840U and installed Debian Bookworm. The out-of-the-box kernel 6.1.64 worked quite good and i recompiled the 6.1.67 vanilla kernel with all options which should be set for a Framework laptop (from different forum threads). With this kernel everything works fine (WLAN, standby, camera, etc).

Now i tried to compile a newer kernel. Up to 6.2.16 WLAN still works as expected but all newer kernels i tried (6.3.13, 6.4.16, 6.5.13, 6.6.6) have a strange connection problem.

I did a very simple test and pinged a local PC from my framework laptop and in reverse did another ping round from my local PC to my framework laptop. The ping from my laptop to my PC always works as expected with nearly equal round trip times. But the ping from the local PC to my laptop starts getting worse with kernels >= 6.3. I got lost packets and rtt of partly >3000ms.

Strange is that the problem only occurs when the ping starts from the local PC but from the laptop it always works. I tried another local PC and the result is the same.

I saved the ping rtt results in Pastebin where the results can be seen
Ping from Laptop to PC (ok)
Ping from PC to Laptop (problems)

Maybe i have to change another kernel compile option to get it working with newer kernels but at the moment i use the same base config with Framework specific settings for all kernels.

What i can see is that directly after boot and WLAN is connected the ping from my PC is quite good. But it takes only a minute or so to start getting worse.
I can’t use i.e. ssh to connect to my laptop because of the high rtt and packet loss.

At the moment i stay at kernel 6.1.67 but i read that many use newer kernels and it seems to work for them. So i wonder what i could try to find the culprit of my problems.

Ciao,
Rainer

Hi @RW1 ,

Welcome to the forums, can you try to monitor signal strength between the two kernels with wavemon?

also there might be errors on dmesg when signal drops on 6.3+ kernel, try checking that as well.

Hi @Loell_Framework,

thanks for the fast response.

I did the same tests as before with pinging the laptop from my PC with kernel 6.1.67 and 6.6.6 and had wavemon and dmesg open. The following screenshots are done while the laptop was being pinged.

Kernel 6.1.67

Kernel 6.6.6

Same behavior as before with kernel 6.6.6 pings got lost and sporadically are very high.

But i see no errors in dmesg which correlates to the WLAN module.

Ciao,
Rainer

There seems to be some known issues with certain AP combinations with the mt76 stack. Disabling power save features seems to resolve the problem of dropouts and latency. So much so it’s included in several ‘performance’ orientated builds - here is one such patch:

mt76:-mt7921:-Disable-powersave-features-by-default.patch

From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Jan200101 <sentrycraft123@gmail.com>
Date: Mon, 27 Nov 2023 15:25:48 +0100
Subject: [PATCH] mt76: mt7921: Disable powersave features by default

This brings WiFi latency down considerably and makes latency consistent by
disabling runtime PM and typical powersave features by default. The actual
power consumption difference is inconsequential on desktops and laptops,
while the performance difference is monumental. Latencies of 20+ ms are no
longer observed after this change, and the connection is much more stable.

Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
 drivers/net/wireless/mediatek/mt76/mt7921/init.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/init.c b/drivers/net/wireless/mediatek/mt76/mt7921/init.c
index ff63f37f67d9..840b4c606c83 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7921/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7921/init.c
@@ -220,12 +220,6 @@ int mt7921_register_device(struct mt792x_dev *dev)
        dev->pm.idle_timeout = MT792x_PM_TIMEOUT;
        dev->pm.stats.last_wake_event = jiffies;
        dev->pm.stats.last_doze_event = jiffies;
-       if (!mt76_is_usb(&dev->mt76)) {
-               dev->pm.enable_user = true;
-               dev->pm.enable = true;
-               dev->pm.ds_enable_user = true;
-               dev->pm.ds_enable = true;
-       }
 
        if (!mt76_is_mmio(&dev->mt76))
                hw->extra_tx_headroom += MT_SDIO_TXD_SIZE + MT_SDIO_HDR_SIZE;
@@ -240,6 +234,8 @@ int mt7921_register_device(struct mt792x_dev *dev)
        if (ret)
                return ret;
 
+       hw->wiphy->flags &= ~WIPHY_FLAG_PS_ON_BY_DEFAULT;
+
        hw->wiphy->reg_notifier = mt7921_regd_notifier;
        dev->mphy.sband_2g.sband.ht_cap.cap |=
                        IEEE80211_HT_CAP_LDPC_CODING |
1 Like

Thanks @jwp for these information.

I applied the patch to the 6.6.6 kernel source and now the ping problems from PC to laptop are gone.

263 packets transmitted, 263 received, 0% packet loss, time 262368ms
rtt min/avg/max/mdev = 1.153/5.467/126.676/11.153 ms

t guess that there is no kernel compile or runtime option to do this without modifying the driver code?

So i have to remember to modify the source code when i want to recompile a newer kernel.

Ciao,
Rainer

Has anyone raised that for discussion on a kernel mailing list? If so can you please link it?

If you say me what i have to to i will do it.
Or maybe @jwp can do this because he seems to know more about this problem.

I would like to see the patch as compile option or runtime parameter in newer kernels.

Ciao,
Rainer

Well I’m not sure where that patch actually came from, maybe it was already on a mailing list. It’s just news to me, so I was wanting to see what Mediatek kernel developers have had to say about it.

There is an obvious power consumption trade off with a patch like that, so I think it’s very important to quantify how much worse it makes things in some predictable workloads (like idle, transferring content, etc) to decide if it’s a generally good idea.

If it’s just from a distro or user somewhere, then I think it needs to be raised still.
I suggest using ./scripts/get_maintainer.pl to find the right people to discuss it with and then sending it up for their feedback.

Maybe there is a happy medium to change the power save policy to not be so aggressive for this card?

Just glancing the code (I’m not familiar with it), maybe MT792x_PM_TIMEOUT is too aggressive and should be increased instead.

When i google for this problem i can find different sites with equal patches. They are not exactly the same as @jwp posted but they look like the same thing.
Site1
Site2

When i run ./scripts/get_maintainer.pl -f drivers/net/wireless/mediatek/mt76/mt7921/init.c i get a few people responsible and linux-wireless@vger.kernel.org as an open mailing list described with MEDIATEK MT76 WIRELESS LAN DRIVER.

Yeah those all looks effectively the same. I guess let me ask this - is this enough to fix it?

	wiphy->flags &= ~WIPHY_FLAG_PS_ON_BY_DEFAULT;

If so; there’s a strong case to be made that several other drivers don’t do it by default.

I can try if this line alone will help with the problem.

I looked at the code more and I think it needs both hunks. I’ll submit something for discussion.

Here you go:
[PATCH 1/2] wifi: mt76: mt7921: Disable powersaving by default (kernel.org)

1 Like

Thanks - that patch came from the sentry-fsync patches which is used for the nobara kernel

I read in the mailing list the following statement:

So although it’s not pretty to look at, bad ping times to the AP aren’t representative of the full user experience.

That’s true but it’s not only the ping which is bad.

Without the patch i can’t ssh into my laptop because the high round trip time makes ssh unusable slow. Sometimes ssh freezes completely and no communication is possible.

Can you respond to the mailing list post with your observations? It’s not as severe for me at home so you may be able to help make the case for upstream changes.

1 Like

@RW1 If you’re too busy to mail the mailing list, I might be able to do some tests, this weekend and could try and mail, I will link the thread here, so people can look, when it could be fixed.

@Mario_Limonciello Any idea, if this fix will be backported ? I’m not sure if you’re familiar, under which conditions, a fix gets backported to older kernel release?

Very unlikely to backport. This is a big policy change. Right now we need to first land on the right decision upstream.

Can you try the suggestions from mailing list to change policy manually in userspace to see if if improves things for you?