Nvme0: controller is down; will reset

Chris_Lane · May 20, 2022, 11:25am

I’ve had my Framework for a couple of weeks, generally pleased axcept for random crashes. Sometimes 3 or 4 a day sometimes none. There was nothing in syslog, however if I tail the syslog then I will see this message from the kernel at the point where the system freezes and then reboots, I guess it can write to screen but not to disk at this point:

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

I’ve installed Smart monitor tool and that reports the SSD is fine, so I’m wondering whether this could be a controller / motherboard issue?

BIOS is 3.07, SSD is WDSN850 500GB.

lbkNhubert · May 20, 2022, 1:06pm

There was a firmware update for the drive, in case you had not applied it. It requires windows, fyi. Hopefully you are able to get to the bottom of the issue. I am using a 2tb 850, running manjaro and previously pop_os, and fortunately did not run into problems. Please do let the community know if you are able to resolve the problem.

Fraoch · May 20, 2022, 2:59pm

The drive firmware fix is related to power-saving/sleep states of the drive. It appears that the drive will go into power saving mode then not wake up in time for Linux to write to it or read from it - Linux thinks there’s a problem, marks the drive bad or failed and refuses to work with it anymore even after the drive is “up” again. A power cycle should fix this but it’s only a matter of time before it happens again.

There are some drives where this is simply broken and cannot be fixed - there are reports of the Crucial P5 just not working at all. Then there are reports where this is fixed with new firmware, like the WD SN850.

This is supposition on my part but there’s definitely a pattern emerging. The fact that this can be “cured” with a firmware update indicates that the power management/sleep states can be too aggressive at sleeping and the timing and responsiveness after the drive goes to sleep needs to be adjusted. The drive manufacturer has to be willing to release new firmware and if this needs to be done only for the Framework laptop the manufacturer may be unwilling to do so.

Ranko_Kohime · May 21, 2022, 2:07am

I have had a similar problem with some Intel 660p SSDs, (not with a Framework, I’m on Batch 2 for the 12th gen) and I’ve had to relegate them to non-boot drives as a result.

Is there perhaps a way to tune the kernel to have a much longer timeout on the drive?

Chris_Lane · May 21, 2022, 6:09am

Thanks for the suggestions. I will see if I can work out a way to get the firmware updated on the drive.

Fraoch · May 21, 2022, 4:08pm

I’m not sure - that would be good, wouldn’t it?

How the OS behaves when a drive goes “bad” is set in /etc/fstab. I wonder if a time delay can be set here, but I think fstab dictates how the drive is handled on boot.

Ranko_Kohime · May 21, 2022, 6:55pm

I was under the impression that systemd is Borg’ing the fstab.

Semi-jokes aside, I was also under the impression that the fstab only deals with mounts, and not the block devices they come from. The actual block device is disappearing in my case.

Fraoch · May 21, 2022, 8:20pm

I wouldn’t doubt that systemd can/will/does override fstab. I haven’t seen it yet though, when I make changes to fstab, it does work.

You’re right though, it only uses fstab for mounting - so only on boot or only when the drive is mounted. And yes, not on block devices, only on partitions/UUIDs.

So it’ll have to be something else. Hopefully someone else chimes in, I’m not enough of a Linux expert to go further, sorry.

Chris_Lane · June 6, 2022, 7:40pm

Just to add that I managed to update the drive firmware after booting from Windows on another drive - but the issue is still repeating, same as before.

Anil_Kulkarni · October 10, 2022, 3:06am

I am also seeing this now on my machine. I rolled back my installation to ~1 week ago and it seems to be stable so far, interestingly. I plan on leaving the system as-is for a few days to confirm, then I’ll re-apply updates and see what happens

Lyndon_Sanche · April 21, 2023, 9:53pm

This is an error I am coming across ever since upgrading my Dell XPS 9560 with a 2TB WD_BLACK SN770 SSD. But somehow it seems the issue has been becoming far more frequent the past few weeks.

I have updated the firmware through windows-to-go but the issue still occurs. Current revision is on 731100WD. This issue did not occur with the stock SSD included with the laptop.

I have the same exact model of SSD in my desktop (ASUS X570) and this error does not surface. I have gotten the SSD tested at my local computer store and they cannot find anything wrong with it. They are likely using windows testbeds which maybe this issue will not surface on.

Apologies if this reply is not appropriate since I do not have a framework laptop to test this on, but it seems like this is a common issue with Linux and certain combinations of motherboard and ssd model. I have tried the pcie_aspm=off and nvme_core=... settings but the problem is not solved. I have even tried downgrading my bios. I am hoping for a fix.

Running NixOS unstable with Linux 6.2.11

Anachron · April 23, 2023, 5:15am

@Lyndon_Sanche from my experience this is caused by a bad nvme drive, RMA it if still possible.

Lyndon_Sanche · April 23, 2023, 12:58pm

@Anachron honestly that’s what I was hoping for.

I took it to my local computer store as they have an in store rma/swap service that I paid for when I bought the drive. They ran a bunch of tests and couldn’t find anything wrong with it.

It’s outside of the normal exchange window unfortunately.

Anachron · April 23, 2023, 2:45pm

Sure, depending on what kind of tests they ran they don’t neccessary need to be able to trigger it themselves. Did you ask them if they were able to reproduce the issue on Linux?

What I can believe is that Windows and MacOS behave differently if the controller shuts down/is unresponsive. Maybe Windows/MacOS wait longer for it to respond and that is actually what needs to happen, because the NVME controller needs more time to reset/respond.

Either way, this shouldn’t happen with a working/functional NVME drive.

My advice: Get a new one and check whether the same thing happens. Get one from a different vendor and model too, to make sure we’re not hit by a bad batch of production.

Also, why would you live with this issue so long that you are out of warranty with this still being triggered? I would have requested a new drive much earlier, this issue renders the drive unuseful.

Lyndon_Sanche · April 23, 2023, 4:20pm

Might try that. I also have the same exact model in my desktop, not exhibiting these problems. I could try swapping them.

The things is, when I first got the drive, my laptop would intermittently freeze while booting, not very often. So I just thought, oh that’s weird, then moved along.

It seems to have gotten a lot more frequent as of late, to the point where it is actually hindering my use of the device.

Topic		Replies	Views
Ssd failure (WD_BLACK™ SN850 NVMe™) Framework Laptop 13	10	3686	October 22, 2021
Consistent System Crashing/Freezing Framework Laptop 13	3	1400	November 3, 2022
FW16 freeze + reboot on Ubuntu 24.04 and latest kernel Linux ubuntu	4	519	July 2, 2024
Replaced NVMe, now I get "read only filesystem" sometimes on wake Linux	25	4836	December 17, 2022
Framework 13 AMD Ryzen with WD_BLACK sn850x 2TB is unstable, sporadically reboots and cannot find the boot device after those reboots until a power-cycle Framework Laptop 13	6	712	April 8, 2024

Nvme0: controller is down; will reset

Related topics