FW13 AMD
SSD: Western Digital Black SN850X 4 TB M.2-2280 PCIe 4.0 X4 NVME
RAM: Crucial CT2K16G56C46S5 32 GB (2 x 16 GB) DDR5-5600 SODIMM CL46 Memory
OS: Fedora 39 Workstation
Everything is updated to the latest versions available through the GUI software manager.
I’m new to Linux/Fedora/Framework as of a couple of weeks. Everything was fine until I installed Gnome Shell Extensions by running:
I then restarted and got the bootloader screen with four Fedora versions as options, all of which boot to emergency mode. So I’m stuck at ‘Press Enter for maintenance (or press Control-D to continue):’.
I presume it may have been an update that installed when I rebooted, not Gnome Extensions, but I don’t know. Or the SSD has corrupted/died.
I can get journalctl to work. At the end it’s showing BTRFS errors:
You could try booting into recovery mode and try to repair the file system. If it was me, honestly, I’d reinstall and before adding my personal data back from a backup, test the steps again that brought me to this state.
I’d also check the disk health with Disks (Gnome Disks).
I found one sector that needed repairing using Disks, but that hasn’t helped. I didn’t get anywhere in recovery mode.
I can reinstall, but I see the SSD is listed twice when I boot to the live media USB, which seems problematic and/or related to this issue. I guess I’ll do a full clean install and see what happens. Bit unsettling to be at this point with no idea why it happened or when it’s going to happen again. I chose components and OS principally for stability and relability and seem to have ended up in the same place that’s made me wary of Linux up until this point.
Still, I’ll persevere.
Not sure to what extent you tried repairing the disk but it seems to be a super block issue to me. If you haven’t wiped yet, I’d give this guide a try.
Thanks I’ll save that for next time. I ended up reinstalling. Everything is working now, but I’m just nervous about this because it was pretty disruptive, I have no clue why it happened, and now won’t be surprised if it happens again.
Just check the health of your disk.
With lsblk check which devices are in your system.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
nvme1n1 259:0 0 465.8G 0 disk
├─nvme1n1p1 259:1 0 512M 0 part /boot/efi
└─nvme1n1p2 259:2 0 465.3G 0 part /
nvme0n1 259:3 0 931.5G 0 disk
└─nvme0n1p1 259:4 0 931.5G 0 part /data
Then issues the command: sudo smartctl -a /dev/[yourdevicename]
On my system it would be: “sudo smartctl -a /dev/nvme0n1”
And past the output here.
What you need to check is - as per the following example of a bad SSD I had:
Check for the Media and data integrity errors. These will tell you if your SSD is dying.
These should be 0 for a healthy SSD. If it shows anything else than 0, you start having problems. By default, the electronics should remap bad blocks and use other cells. But if that error shows up, it means the cells are degrading slowly.
In my disks example above, I had written only 25TB of data, while it should sustain 600TBW from the manufacturer. They replaced me the disk straight, but the problems I had because if it were a PITA.
Apparently, a Firmware update fixed the disks behavior. All other disk (Samsung 980 series), I have updated the firmware (Gen3 SSD disks in my server).
Thank you - that’s very helpful. I didn’t even have smartctl installed. I’ll paste my results as soon as I figure out how to format the output properly as you have done. How did you do that?
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 26 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,847,318 [945 GB]
Data Units Written: 4,226,473 [2.16 TB]
Host Read Commands: 10,346,282
Host Write Commands: 30,109,323
Controller Busy Time: 74
Power Cycles: 150
Power On Hours: 12
Unsafe Shutdowns: 14
Media and Data Integrity Errors: 0
Error Information Log Entries: 262
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Read Self-test Log failed: Invalid Field in Command (0x4002)
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 26 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,847,318 [945 GB]
Data Units Written: 4,226,473 [2.16 TB]
Host Read Commands: 10,346,282
Host Write Commands: 30,109,323
Controller Busy Time: 74
Power Cycles: 150
Power On Hours: 12
Unsafe Shutdowns: 14
**Media and Data Integrity Errors: 0**
**Error Information Log Entries: 262**
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
The logs are not kept after a reboot it seems. Because it says it has log entries, and shows none.
I would make a long test of the device. smartctl --test=long /dev/[devicename] and look at the logs after that.
It will take a while though (On the HDD disks it could take a day or more ).
sudo smartctl --test=long /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.7.4-200.fc39.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
Read Self-test Log failed: Invalid Field in Command (0x4002)
Oh wait, that’s not as root is it? Trying to access root now…
Ok - no still not working. Same as before.
EDIT: removing the ‘n1’ from the end of the command seems to have initiated the long test, although it’s hard to tell if it’s running since it issues another input prompt as soon as I run it…
I have successfully updated the firmware (WD don’t make that easy on Linux). It still shows the same Error Information Log Entries on the quick test. Running the long test now, but presume results would be the same.
Would you RMA this drive?
EDIT: reading around a bit, this seems fairly common and often get +1 error log per boot. This may have been the case with mine with something misconfigured. Since the reinstall the number has not been going up, so I’ll see how it goes for now.
I would at least ask support (manufacturer) if that is something normal.
Depending on their response I would ask for an RMA as apparently their drive is not functioning correctly, or a fix for it.
The question is - what can you “misconfigure” with a NVMe drive?
It should not affect the logs. The Filesystem (ext4, btrfs, zfs etc.) maybe - but that is OS based.
We’re talking about the hardware here that is dealt by with firmware from the manufacturer. So we are a level below our possible influence to the drive’s firmware.