Error on the NVMe disk when installing Arch Linux

Breizh · March 4, 2024, 6:49pm

Hello,

I’m trying to install Arch Linux on my Framework 16, but after the installation, when I boot on the newly installed system, it fails to mount the EFI partition on /boot, and the root filesystem (Btrfs) switches to read-only.

On the occasions when I’ve managed to get a shell despite the errors, I’ve noticed an error similar to the one mentioned on this topic: Nvme0: controller is down; will reset

I tried the nvme_core.default_ps_max_latency_us=0 workaround, it didn’t change anything.

I tried to install Fedora since it’s officially supported, and I get errors on boot too, but I couldn’t have a shell, so I don’t have the exact cause.

I tried other systems, like Ubuntu 22.04 LTS, it didn’t seems to have this problem. Linux Mint neither. Even Fedora XFCE worked correctly.

And when I’m using a live USB (even the Arch one), I can access and use the disks normally, the installation don’t cause any problem, and after chrooting into it, everything works correctly.

So my guess is that it’s linked to the kernel or a kernel parameter, but under Arch even with kernel-lts I have the same problem and I don’t find any other option that can be linked to this. Maybe it’s because of Btrfs (since both my Arch setup and Fedora are using it), but Linux Mint on a Btrfs worked well.

Both NVMe are affected (the 2280 is a WD SN770, I’ve updated its firmware, it doesn’t change anything, the 2230 is a WD SN740, there’s no firmware update available, I bought both of them with the computer, they are the ones sold by Framework).

I’m starting to despair a bit, have other people encountered this problem, and/or know what else I can try, either to try to fix or to get more informations on the problem? (since I can’t access the system if it fails at boot, and log nothing useful since the filesystem is read-only)

I’ll try to install Arch on an ext4 filesystem, just in case, but since the problem seems to be random, if it’s indeed the problem of power-saving modes, maybe the setups that were working were working by chance, and the only flaw in Btrfs is that it causes it almost systematically, instead of only once in a while…

RandomRanger · March 4, 2024, 6:53pm

I can say I’ve had no issues like this on my arch install. I simply dropped the ssd in from the old laptop and it worked (after disabling secure boot). Idk if that’s worth anything to you, but I figured I’d share.

Breizh · March 4, 2024, 7:55pm

I installed Arch on ext4, on an LVM RAID1, it seems to be working properly for now… I’ll try using mainly this PC over the next few days to see if it actually solves the problem or just reduces its frequency of occurrence.

James3 · March 4, 2024, 9:42pm

Hi @Breizh

This error generally means a hardware problem.
I would suggest trying to re-seat the NVME card in its slot.
Also use smartctl -a /dev/nvmeXXXX (whatever your nvme device name is) to see if it has logged any errors.
It might indicate that the NVME device is about to fail so backup all you data ASAP.
FYI, the 2230 is I think the “nvme0” device. There have been reports of problems with the 2230 slot, but a BIOS fix is supposed to have fixed that. Check if you have the most up to date BIOS installed.
Note: You say both are affected, but only mention an error message for nvme0.

RandomRanger · March 4, 2024, 9:53pm

I was on the BIOS page the other day and it listed “no bios updates available” so if that was fixed after whatever version the units are shipping with, that fix may not have been posted yet.

James3 · March 4, 2024, 9:58pm

The mention about the 2230 was in the " Sixth update on Framework Laptop 16 shipment timing".
There are also a few issues that we are still tracking, but which we aren’t holding production for:
Secondary SSD may disappear - We found that the secondary SSD (the M.2 2230 SSD) may not be visible on some boots or may rarely disappear during sleep. We’ve debugged this issue with AMD, who have traced it back to a bug in the platform firmware. They are releasing the fix to us, which we will include in a BIOS update. We’ll share BIOS updaters for Windows and Linux when this is ready, as well as roll the BIOS into the factory for new system production.

That was before the first shipment of FW16, so its not 100% whether it got into the BIOS for your FW16 or not.

Breizh · March 5, 2024, 9:07am

Oh, that’s true, I forgot about this issue. Both are affected, but it was by far more common on the 2230, yes (nvme0 and nvme1 sometimes switches their names). I admit that I didn’t see the error when it was the other one that was inaccessible, because I couldn’t have a shell access to check the exact error in these cases. But it was the same symptoms.

There is no error in the smartctl logs, nor with a full test. I don’t have any data on them since they are new (they were shipped with the computer, like I said). I didn’t try to re-seat them since they are working correctly in a lot of cases… just not the one I want

It could be great to know if they have fixed the BIOS like announced or if it’s still work in progress, in the second case I’ll redo the tests after the upgrade.

Thanks.

Anachron · March 5, 2024, 9:35am

I had a similiar issue, no smartctl or filesystem errors found. But my NVME would randomly go into sleep no matter which kernel arguments I would pass.

I RMA-ed it and replaced it with a Samsung Pro one which doesnt have this issue.

Edit: I did even upgrade the firmware to no success.

GaKu999 · May 19, 2024, 12:01am

Reporting the same issue, with more details after continued testing (and even getting a replacement NVME, thinking the previous one faulty).
WD_BLACK 1TB SN770M 2230, the core of the issue is a controller error during heavy parallelized I/O, I have been unable to prevent said errors with any kernel cmdline args thus far, and it seems even some Windows users are dealing with similar errors.
I’m still trying to narrow down the actual culprit, but in case this information is relevant to anyone within Framework, I’ll leave this here.

It appears to be oddly specific, at least with which sectors being accessed in a parallelized manner, and the fact it happens with two different drives, without any reported errors on SMART is most strange.

And yes, BIOS has already been updated to 0.0.3.3 with fwupd.

GaKu999 · May 19, 2024, 1:32am

I have tested said NVME with an NVME-to-USB enclosure, failed to reproduce the error…
Also tested an older and smaller NVME on the 2230 port, with a randread test via fio with thousand of jobs, failing to reproduce the issue.

It looks like a compatibility issue, or something else on a lower level.

Breizh · May 19, 2024, 3:51am

Good to know that I’m not alone and that you could get more details about it.

I’m still having the issue with the new BIOS too. But it works fine with ext4, so I’m just using the computer with my systems on ext4 for now.