[TRACKING] DIY edition Ubuntu - Filesystem in Readonly mode

Anachron · February 26, 2023, 9:04pm

Well it’s right in the messages:

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

The NVME controller should not shut down and be unresponsive to kernel actions.

Your hack is probably keeping some timers on the NVME low so that it will prevent the controller from doing in its idle state when there was no read for X seconds.

Edit: And probably because I had the same issue (same error) and exchanged my drive with a new one and did not have this error again on the exact same setup? (I cloned my old drive to the new one)

Matt_Hartley · February 28, 2023, 12:14am

Agreed, yes, this is a bad nvme drive. As others have pointed out, the controller reset error is telling.

If you purchased the nvme drive from Framework, please contact support for help there.
Linking to this post will help speed up the process as they’ll be able to see this isn’t a Linux issue, rather, it’s a bad drive.

Frederic_Van_Espen · March 10, 2023, 7:39pm

I don’t have a framework device, But I did just recently (yesterday) purchase 2 WD Black SN770 ssd’s. I’m seeing the same problem for both of them. I actually have these in a zfs mirrored pool and they only work … briefly … . As soon as I start reading too much data they stop working and I get the same errors as @Ronan_McHugh.

Mind you, it’s not impossible that both of these are already dead after one day, but that feels like a long shot. I’m still troubleshooting them though. I happen to have an M2 thunderbolt enclosure so I can test them on different kinds of hardware I have at my disposal.

nadb · March 10, 2023, 8:13pm

What is the temperature of the drives when this happens? They are evne more likely to overheat in a zfs pool. Also is the firmware the latest and were they purchased at the same time from the same vendor. This last being yes increases the likelihood they are in fact both broken as they may be in the same lot number.

Regardless, I have seen enough issues with these drives both here, and elsewhere to disuade me from even thinking about buying them.

Fraoch · March 10, 2023, 10:27pm

When I was building my ZFS server (FreeNAS) which was 10 years ago now (wow), there were reports of drives dropping out. (Mechanical drives back then).

ZFS expects a drive in a pool to respond quickly, otherwise it marks it as unresponsive and drops it from the pool. Drives with power saving features that sleep when idle and don’t wake up quickly enough were especially problematic.

More evidence that these drives have overly aggressive sleep settings and don’t wake up quickly enough. This can be fixed in firmware but needs the manufacturer’s support.

That said, the two WD SN850s I have, including one in my Framework laptop, haven’t had any issues. But the latest/last firmware update for the SN850 seemed to address this issue.

Frederic_Van_Espen · March 11, 2023, 9:19am

@nadb I did not check the temperatures on the drives, but both are fitted with a fairly large heatsinks and those feel lukewarm at best.

When the drives came they were formatted with 512 byte sectors. When I used them then they were completely fine. I then reformatted them with 4096 byte sectors, since that’s what they use internally anyway:

:~# nvme id-ns -H /dev/nvme0n1 | grep "LBA Format"
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better
:~# nvme format --lbaf=1 /dev/nvme0n1

At first that seemed to work fine so I moved everything over to it, but then the issue started happening.

As far as troubleshooting goes, it seems to be some issue related to nvme and zfs. I tested on 2 different systems (one desktop with the drives plugged directly to the mainboard, and one laptop with the drive connected to a thunderbolt nvme enclosure).
I could perfectly do a dd like this:

dd if=/dev/nvme0n1 of=/dev/null bs=4M

This would read the full disk without issues. So I dumped the full drive to an external disk. As far as I can tell that means at least that the drive is really not that dead. As soon as I import the pool and try to zfs send the system hangs for a bit and then suspends the pool (with the kernel messages above).

So, I reverted back to the 512 byte sectors now:

:~# nvme format --lbaf=0 /dev/nvme0n1

I recreated the pool and imported from the dumped dd data. Both drives run happily now.

Now, I don’t know what’s the actual cause of this, but seems like a firmware issue right?

Frederic_Van_Espen · March 11, 2023, 9:50am

I created a case at WD with details on how to reproduce. We’ll see how they respond.

Matt_Hartley · March 13, 2023, 7:21pm

Appreciate your sharing your experience, but noting this to the thread so there is no confusion for anyone happening upon this thread:

I am on a 770 right now, not seeing issues on my own system at this time.

Iann_C · March 13, 2023, 8:52pm

My 2 cents. Running Ubuntu 22 since september 22 on the FW. Working great.
I have bought a samsung SSD. As many says Westerdigital are full of bugs on this forum.
Wondering why FW sells them by the way …

Anachron · March 13, 2023, 9:24pm

I can also only recommend to get a different SSD from another brand. NVMEs are not very expensive nowadays and cloning a drive to another can be done in hours.

(I went with the Samsung Pro 990, very satisfied!)

Ronan_McHugh · March 14, 2023, 8:24pm

Thanks for the advice. Unfortunately Framework will only replace with the same part so I’m getting the same model NVME delivered. I’ll install that when it arrives and see if the problem recurs. If so, I guess I’ll have to follow your advice and buy another model.

Matt_Hartley · March 15, 2023, 12:39am

Please do keep us posted. If it happens again, please walk me through the steps you took to arrive at a read only state.

Anachron · March 15, 2023, 7:07am

You could do as I did:
Buy a non-WD drive, try it in your setup. If it works, send back the WD NVME to Framework to get the money back. If it doesn’t work, send both drives back and wait for Framework to send you a new one.

I use a few WD harddrives on my NAS and all work very well, it just seems that the NVMEs are a hit or miss.

spam · March 20, 2023, 8:52pm

I just wanted to +1 this topic. I am seeing that exact same behaviour on Arch Linux with a DIY 12th Gen Framework.

My specific configuration:

System: Intel® Core™ i7-1260P
Storage: 1TB - WD_BLACK™ SN770 NVMe™
Memory: 32GB (2 x 16GB) DDR4-3200

I tried the recommended flags in the journalctl entries, more specifically:

However these flags did not remedy the issue for me. I also looked at the firmware links and as far as I can tell my SSD already has the latest firmware installed.

After days of frequent and random read-only file system situations, yesterday I reinstalled my system on a Samsung 980 NVMe SSD from another laptop. So far, I have not experienced the problem and will report if I do.

Assuming changing out the bundled SSD is the ultimate solution to this problem I would be curious as to whether the original NVMe can be warrantied.

If there is any more specific information I can provide or testing I can perform I would be more than happy to cooperate

Outside of these issues my experience with the laptop has been great .

Anachron · March 21, 2023, 6:48am

Like I said elsewhere, I couldnt get my WD NVME to not randomly disconnect so I went with a 990 Pro which never had this issue. I’m well beyond 4 weeks at this point.

I think its a faulty firmware/hardware combination on the WD side.

Returned the WD drive to Framework after a few weeks of testing (whether the drive is really bad or its my setup). Got my money back for it.

Quentin_Aymard · December 19, 2023, 3:25pm

Hi there, kinda digging out this subject but one of our colleagues currently testing a Framework Laptop 13 (Intel 13th Gen) is experiencing similar issues with a 1TB WD_BLACK SN770 and Ubuntu 22.04. Ubuntu is fully up to date and all available firmware updates are installed. SN770’s firmware version is 731100WD, which seems to be the latest for this model.

He experiences read-only error/crashes kinda randomly, but always when he is away from his laptop for a while. I suspected it had something to do with auto sleep mode. We tried disabling all power saving features for now, in order to confirm this hypothesis.

I also did a bit of googling (and here I am) and saw a few forum posts here and there suggesting that WD is reaaaaaaaally not the recommended hardware vendor for any kind of Linux installation, as they do not support LVFS update, do not provide firmware updating tools, and overall appears to have very bad compatibility with Ubuntu (and others ?) calls to fstrim functions.

We have 2 physically identical other laptops where the problem does not appear. One is running Windows 11, the other is running Debian 12. Fingers crossed so far.

Any future Framework test laptop or volume order will certainly avoid SN770, and avoid WD altogether.

So far, still no solution to provide to this tracking topic, but the problem is still out there on 13th gen.

Ronan_McHugh · December 19, 2023, 5:03pm

btw, I forgot to update here. I installed the replacement hard-drive and haven’t had any issues since. Thanks everyone for the support and advice.