[RESPONDED] Stuck in emergency boot mode

phrwn · February 13, 2024, 8:37pm

FW13 AMD
SSD: Western Digital Black SN850X 4 TB M.2-2280 PCIe 4.0 X4 NVME
RAM: Crucial CT2K16G56C46S5 32 GB (2 x 16 GB) DDR5-5600 SODIMM CL46 Memory
OS: Fedora 39 Workstation
Everything is updated to the latest versions available through the GUI software manager.

I’m new to Linux/Fedora/Framework as of a couple of weeks. Everything was fine until I installed Gnome Shell Extensions by running:

sudo dnf install libappindicator-gtk3 gnome-shell-extension-appindicator gnome-extensions-app

I then restarted and got the bootloader screen with four Fedora versions as options, all of which boot to emergency mode. So I’m stuck at ‘Press Enter for maintenance (or press Control-D to continue):’.

I presume it may have been an update that installed when I rebooted, not Gnome Extensions, but I don’t know. Or the SSD has corrupted/died.

I can get journalctl to work. At the end it’s showing BTRFS errors:

I just reseated the SSD, as per a suggestion in another thread, but it didn’t help. Not sure what to do next.

Matt_Hartley · February 13, 2024, 8:48pm

Welcome to the community!

You could try booting into recovery mode and try to repair the file system. If it was me, honestly, I’d reinstall and before adding my personal data back from a backup, test the steps again that brought me to this state.

I’d also check the disk health with Disks (Gnome Disks).

phrwn · February 13, 2024, 11:51pm

I found one sector that needed repairing using Disks, but that hasn’t helped. I didn’t get anywhere in recovery mode.

I can reinstall, but I see the SSD is listed twice when I boot to the live media USB, which seems problematic and/or related to this issue. I guess I’ll do a full clean install and see what happens. Bit unsettling to be at this point with no idea why it happened or when it’s going to happen again. I chose components and OS principally for stability and relability and seem to have ended up in the same place that’s made me wary of Linux up until this point.
Still, I’ll persevere.

cmstew · February 14, 2024, 5:06am

Not sure to what extent you tried repairing the disk but it seems to be a super block issue to me. If you haven’t wiped yet, I’d give this guide a try.

https://www.cyberciti.biz/faq/recover-bad-superblock-from-corrupted-partition/

phrwn · February 14, 2024, 7:29am

Thanks I’ll save that for next time. I ended up reinstalling. Everything is working now, but I’m just nervous about this because it was pretty disruptive, I have no clue why it happened, and now won’t be surprised if it happens again.

Jorg_Mertin · February 14, 2024, 9:17am

Just check the health of your disk.
With lsblk check which devices are in your system.

# lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sr0          11:0    1  1024M  0 rom  
nvme1n1     259:0    0 465.8G  0 disk 
├─nvme1n1p1 259:1    0   512M  0 part /boot/efi
└─nvme1n1p2 259:2    0 465.3G  0 part /
nvme0n1     259:3    0 931.5G  0 disk 
└─nvme0n1p1 259:4    0 931.5G  0 part /data

Then issues the command: sudo smartctl -a /dev/[yourdevicename]
On my system it would be: “sudo smartctl -a /dev/nvme0n1”
And past the output here.
What you need to check is - as per the following example of a bad SSD I had:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    97%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    28,856,115 [14.7 TB]
Data Units Written:                 49,244,732 [25.2 TB]
Host Read Commands:                 268,782,702
Host Write Commands:                672,738,666
Controller Busy Time:               2,586
Power Cycles:                       106
Power On Hours:                     5,141
Unsafe Shutdowns:                   17
**Media and Data Integrity Errors:    38**
**Error Information Log Entries:      38**
Warning  Comp. Temperature Time:    226
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               35 Celsius
Temperature Sensor 2:               39 Celsius
Thermal Temp. 2 Transition Count:   59326
Thermal Temp. 2 Total Time:         18032

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
 0         38    12  0xe1cc  0xc502  0x000    597258488     1     -
 1         37    10  0xf0d3  0xc502  0x000    597258488     1     -
 2         36    12  0xe1c1  0x4502  0x000    597258488     1     -
 3         35     1  0xb26b  0xc502  0x000   1000537344     1     -
 4         34     3  0x80c5  0xc502  0x000   1000537344     1     -
 5         33     1  0xc26a  0xc502  0x000   1000537344     1     -
 6         32     3  0x80c4  0xc502  0x000   1000535040     1     -
 7         31     3  0xc0c3  0xc502  0x000   1000528960     1     -
 8         30     1  0xf269  0xc502  0x000   1000528960     1     -
 9         29     3  0xb0c2  0xc502  0x000   1000528960     1     -
10         28     1  0xd3ad  0xc502  0x000   1000537344     1     -
11         27     1  0x33ac  0xc502  0x000   1000537344     1     -
12         26     4  0xe1be  0x4502  0x000   1000537344     1     -
13         25     1  0x43a6  0xc502  0x000   1000535040     1     -
14         24     1  0xc3a4  0x4502  0x000   1000535040     1     -
15         23     1  0xb399  0xc502  0x000   1000528960     1     -
... (20 entries not shown)

Check for the Media and data integrity errors. These will tell you if your SSD is dying.
These should be 0 for a healthy SSD. If it shows anything else than 0, you start having problems. By default, the electronics should remap bad blocks and use other cells. But if that error shows up, it means the cells are degrading slowly.
In my disks example above, I had written only 25TB of data, while it should sustain 600TBW from the manufacturer. They replaced me the disk straight, but the problems I had because if it were a PITA.
Apparently, a Firmware update fixed the disks behavior. All other disk (Samsung 980 series), I have updated the firmware (Gen3 SSD disks in my server).

phrwn · February 14, 2024, 5:39pm

Thank you - that’s very helpful. I didn’t even have smartctl installed. I’ll paste my results as soon as I figure out how to format the output properly as you have done. How did you do that?

Jorg_Mertin · February 14, 2024, 5:58pm

The text mode is pretty simple: Put 3x ` (UPDATE: Sorry, left tick) in front and at the back of the text to be formatted.

phrwn · February 14, 2024, 6:05pm

Ok 3x ’ didn’t work for me, but here’s what I got:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.7.4-200.fc39.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: WD_BLACK SN850X 4000GB
Serial Number: 23402U800222
Firmware Version: 624331WD
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 8224
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 4,000,787,030,016 [4.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b4c1dd74f
Local Time is: Wed Feb 14 09:29:41 2024 PST
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 90 Celsius
Critical Comp. Temp. Threshold: 94 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.00W 9.00W - 0 0 0 0 0 0
1 + 6.00W 6.00W - 0 0 0 0 0 0
2 + 4.50W 4.50W - 0 0 0 0 0 0
3 - 0.0250W - - 3 3 3 3 3100 11900
4 - 0.0050W - - 4 4 4 4 3900 45700

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 26 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,847,318 [945 GB]
Data Units Written: 4,226,473 [2.16 TB]
Host Read Commands: 10,346,282
Host Write Commands: 30,109,323
Controller Busy Time: 74
Power Cycles: 150
Power On Hours: 12
Unsafe Shutdowns: 14
Media and Data Integrity Errors: 0
Error Information Log Entries: 262
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x4002)

Jorg_Mertin · February 14, 2024, 6:10pm

Contradictory…

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 26 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,847,318 [945 GB]
Data Units Written: 4,226,473 [2.16 TB]
Host Read Commands: 10,346,282
Host Write Commands: 30,109,323
Controller Busy Time: 74
Power Cycles: 150
Power On Hours: 12
Unsafe Shutdowns: 14
**Media and Data Integrity Errors: 0**
**Error Information Log Entries: 262**
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

The logs are not kept after a reboot it seems. Because it says it has log entries, and shows none.

I would make a long test of the device. smartctl --test=long /dev/[devicename] and look at the logs after that.
It will take a while though (On the HDD disks it could take a day or more ).

phrwn · February 14, 2024, 6:37pm

When I run a long test I just get the line that I get at the end of then normal test and nothing else:

Read Self-test Log failed: Invalid Field in Command (0x4002)

because nothing’s easy.

Jorg_Mertin · February 14, 2024, 6:54pm

You tried as root?

phrwn · February 14, 2024, 7:19pm

Yep

sudo smartctl --test=long /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.7.4-200.fc39.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Read Self-test Log failed: Invalid Field in Command (0x4002)

Oh wait, that’s not as root is it? Trying to access root now…
Ok - no still not working. Same as before.

EDIT: removing the ‘n1’ from the end of the command seems to have initiated the long test, although it’s hard to tell if it’s running since it issues another input prompt as soon as I run it…

EDIT2: I’m learning alot today:
Self-test status: Extended self-test in progress (9% completed)

phrwn · February 14, 2024, 11:18pm

Long test complete.

Model Number:                       WD_BLACK SN850X 4000GB
Serial Number:                      23402U800222
Firmware Version:                   624331WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8224
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b4c1dd74f
Local Time is:                      Wed Feb 14 15:15:36 2024 PST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     94 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W    9.00W       -    0  0  0  0        0       0
 1 +     6.00W    6.00W       -    0  0  0  0        0       0
 2 +     4.50W    4.50W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3100   11900
 4 -   0.0050W       -        -    4  4  4  4     3900   45700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,913,630 [979 GB]
Data Units Written:                 4,238,416 [2.17 TB]
Host Read Commands:                 10,621,491
Host Write Commands:                30,473,889
Controller Busy Time:               78
Power Cycles:                       154
Power On Hours:                     14
Unsafe Shutdowns:                   14
Media and Data Integrity Errors:    0
Error Information Log Entries:      262
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                  14            -     -   -   -    -

Completed without error sounds good. Seems like the drive is ok or am I missing something?

Jorg_Mertin · February 15, 2024, 8:10am

Yes - but what I wonder is this line:

Error Information Log Entries:      262

Check on the manufacturers site if there is a BIOS update for the drive.
Just to be sure.

phrwn · February 15, 2024, 7:52pm

I have successfully updated the firmware (WD don’t make that easy on Linux). It still shows the same Error Information Log Entries on the quick test. Running the long test now, but presume results would be the same.
Would you RMA this drive?

EDIT: reading around a bit, this seems fairly common and often get +1 error log per boot. This may have been the case with mine with something misconfigured. Since the reinstall the number has not been going up, so I’ll see how it goes for now.

Thanks to all who helped in this thread, as well as the thread on how to update firmware on WD drives.

Jorg_Mertin · February 16, 2024, 8:27am

I would at least ask support (manufacturer) if that is something normal.
Depending on their response I would ask for an RMA as apparently their drive is not functioning correctly, or a fix for it.

The question is - what can you “misconfigure” with a NVMe drive?
It should not affect the logs. The Filesystem (ext4, btrfs, zfs etc.) maybe - but that is OS based.
We’re talking about the hardware here that is dealt by with firmware from the manufacturer. So we are a level below our possible influence to the drive’s firmware.

phrwn · February 16, 2024, 4:34pm

I’ll send a message to WD and see what they say.