Nvme0: controller is down; will reset

I’ve had my Framework for a couple of weeks, generally pleased axcept for random crashes. Sometimes 3 or 4 a day sometimes none. There was nothing in syslog, however if I tail the syslog then I will see this message from the kernel at the point where the system freezes and then reboots, I guess it can write to screen but not to disk at this point:

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

I’ve installed Smart monitor tool and that reports the SSD is fine, so I’m wondering whether this could be a controller / motherboard issue?

BIOS is 3.07, SSD is WDSN850 500GB.

1 Like

There was a firmware update for the drive, in case you had not applied it. It requires windows, fyi. Hopefully you are able to get to the bottom of the issue. I am using a 2tb 850, running manjaro and previously pop_os, and fortunately did not run into problems. Please do let the community know if you are able to resolve the problem.

The drive firmware fix is related to power-saving/sleep states of the drive. It appears that the drive will go into power saving mode then not wake up in time for Linux to write to it or read from it - Linux thinks there’s a problem, marks the drive bad or failed and refuses to work with it anymore even after the drive is “up” again. A power cycle should fix this but it’s only a matter of time before it happens again.

There are some drives where this is simply broken and cannot be fixed - there are reports of the Crucial P5 just not working at all. Then there are reports where this is fixed with new firmware, like the WD SN850.

This is supposition on my part but there’s definitely a pattern emerging. The fact that this can be “cured” with a firmware update indicates that the power management/sleep states can be too aggressive at sleeping and the timing and responsiveness after the drive goes to sleep needs to be adjusted. The drive manufacturer has to be willing to release new firmware and if this needs to be done only for the Framework laptop the manufacturer may be unwilling to do so.

3 Likes

I have had a similar problem with some Intel 660p SSDs, (not with a Framework, I’m on Batch 2 for the 12th gen) and I’ve had to relegate them to non-boot drives as a result.

Is there perhaps a way to tune the kernel to have a much longer timeout on the drive?

Thanks for the suggestions. I will see if I can work out a way to get the firmware updated on the drive.

I’m not sure - that would be good, wouldn’t it?

How the OS behaves when a drive goes “bad” is set in /etc/fstab. I wonder if a time delay can be set here, but I think fstab dictates how the drive is handled on boot.

I was under the impression that systemd is Borg’ing the fstab. :smile:

Semi-jokes aside, I was also under the impression that the fstab only deals with mounts, and not the block devices they come from. The actual block device is disappearing in my case.

I wouldn’t doubt that systemd can/will/does override fstab. I haven’t seen it yet though, when I make changes to fstab, it does work.

You’re right though, it only uses fstab for mounting - so only on boot or only when the drive is mounted. And yes, not on block devices, only on partitions/UUIDs.

So it’ll have to be something else. Hopefully someone else chimes in, I’m not enough of a Linux expert to go further, sorry.

Just to add that I managed to update the drive firmware after booting from Windows on another drive - but the issue is still repeating, same as before.

I am also seeing this now on my machine. I rolled back my installation to ~1 week ago and it seems to be stable so far, interestingly. I plan on leaving the system as-is for a few days to confirm, then I’ll re-apply updates and see what happens

This is an error I am coming across ever since upgrading my Dell XPS 9560 with a 2TB WD_BLACK SN770 SSD. But somehow it seems the issue has been becoming far more frequent the past few weeks.

I have updated the firmware through windows-to-go but the issue still occurs. Current revision is on 731100WD. This issue did not occur with the stock SSD included with the laptop.

I have the same exact model of SSD in my desktop (ASUS X570) and this error does not surface. I have gotten the SSD tested at my local computer store and they cannot find anything wrong with it. They are likely using windows testbeds which maybe this issue will not surface on.

Apologies if this reply is not appropriate since I do not have a framework laptop to test this on, but it seems like this is a common issue with Linux and certain combinations of motherboard and ssd model. I have tried the pcie_aspm=off and nvme_core=... settings but the problem is not solved. I have even tried downgrading my bios. I am hoping for a fix.

Running NixOS unstable with Linux 6.2.11

1 Like

@Lyndon_Sanche from my experience this is caused by a bad nvme drive, RMA it if still possible.

@Anachron honestly that’s what I was hoping for.

I took it to my local computer store as they have an in store rma/swap service that I paid for when I bought the drive. They ran a bunch of tests and couldn’t find anything wrong with it.

It’s outside of the normal exchange window unfortunately.

Sure, depending on what kind of tests they ran they don’t neccessary need to be able to trigger it themselves. Did you ask them if they were able to reproduce the issue on Linux?

What I can believe is that Windows and MacOS behave differently if the controller shuts down/is unresponsive. Maybe Windows/MacOS wait longer for it to respond and that is actually what needs to happen, because the NVME controller needs more time to reset/respond.

Either way, this shouldn’t happen with a working/functional NVME drive.

My advice: Get a new one and check whether the same thing happens. Get one from a different vendor and model too, to make sure we’re not hit by a bad batch of production.

Also, why would you live with this issue so long that you are out of warranty with this still being triggered? I would have requested a new drive much earlier, this issue renders the drive unuseful.

Might try that. I also have the same exact model in my desktop, not exhibiting these problems. I could try swapping them.

The things is, when I first got the drive, my laptop would intermittently freeze while booting, not very often. So I just thought, oh that’s weird, then moved along.

It seems to have gotten a lot more frequent as of late, to the point where it is actually hindering my use of the device.