FW16 freeze + reboot on Ubuntu 24.04 and latest kernel

Since I updated to kernel 6.8.0-36-generic last week, my FW 16, after resuming from sleep, will freeze after a couple of minutes and reboot by itself.

It does not happen on kernel 6.8.0-31-generic.

Anybody observing the same problem?

Just right before it reboots I get this on dmesg and kernel logs:

==> kern.log <==
2024-06-30T16:27:40.003817+01:00 steiner-fw16 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
2024-06-30T16:27:40.003849+01:00 steiner-fw16 kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
2024-06-30T16:27:40.003850+01:00 steiner-fw16 kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug

Try FW support and reporting upstream, definitely a power-saving bug in the NVME controller or a bug in the kernel.

Oh, btw what NVME drive do you use? Might have something to do on the controller present on it.

Thank you, I’ve reported it to ubuntu’s bugtracker here:

I have in fact 2 drives:

  • WD SN850X - this holds Ubuntu
  • Corsair MP600

Controllers are very different between these two SSDs:

  • WD in-house 20-82-20035-B1 for the SN850X 1b
  • Phison PS5018-E18 for the MP600 Pro 1tb

So it’s not related to the controller of the drives, maybe more to your motherboard or hardware drivers.

Well, as I mentioned above this happened with the jump to kernel 6.8.0-35 (not actually 36, after I investigated further for the launchpad report) from 6.8.0-31. Right now I’ve apt-held that one kernel and I’m booting from it or I just can’t get any work done.

The changelog is rather large, looks like a few upstream 6.8 point releases were synced into the ubuntu kernel. Nothing obvious (to me) stands out as a potential culprit:

https://bugs.launchpad.net/ubuntu/+source/linux/6.8.0-35.35

A few nvme specific mentions. A more trained eye might see something obvious I’ve surely missed

I just had a similar problem with 24.04.03 LTS. Computer was left idle and (presumably) went to sleep, but then I noticed the power light had gone off. Turned it back on and looked at /var/log/kern.log, but no kernel panic was reported there. Last error message before it died was

kernel: audit: type=1107 audit(1769531342.607:4569): pid=1113 uid=101 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor=“DENIED” operation=“dbus_signal” bus=“system” path=“/org/freedesktop/login1” interface=“org.freedesktop.login1.Manager” member=“PrepareForShutdown” name=“:1.12” mask=“receive” pid=8990 label=“snap.firefox.firefox” peer_pid=1143 peer_label=“unconfined”

Does that mean apparmor shut the system down abruptly because it thought firefox had loaded some malicious code, or is it unrelated? But hopefully this won’t happen again …

Apparmor doesn’t do that. It merely blocks/logs the events. Whatever led to the laptop shutdown must have been differenet.

If you’re lucky there may be some EC logs in the buffer. Running ectool console may be more useful in debugging such event. The buffer is quite small though.

Seeing as how the computer seemed to shut itself down, I doubt there’s anything still in RAM. Installed coreboot-utils but the ectool therein doesn’t have a console mode? So I hope this isn’t a harbinger of a hardware failure …

It’s not in RAM. EC stands for Embedded Controller, it’s low power “micro computer” that works alongside the powerful one and manages the system in cooperation with operating system. Whatever shut your system down might have got logged there.

If EC rebooted also it may indicate a deeper problem (rebooting of EC also shuts the main system down).

Still this is a long shot.

The commands “sudo ectool -i” and “sudo ectool -d” just print “You have to be root”; what do I have to do, boot into single user mode to run it? But if you say it’s a long shot, I’m not going to bother.