[RESPONDED] Sudden loss of storage while laptop running

baptistemm · December 8, 2023, 7:11am

Hello,

(AMD batch 7)

My laptop suddently got a black screen and displayed then after this me.

I opened the laptop, but the storage was properly plugged.

Anyway I removed the screw, plugged back the storage and the laptop was able to boot again. However I’m a bit worried about a hardware failure either in the laptop or the storage.

2disbetter · December 8, 2023, 7:44am

Was the NVME drive loose at all? Reseating it was a good idea. If this does happen again, I would open a ticket with support, as you might have a hardware issue.

baptistemm · December 8, 2023, 7:59am

Hi @2disbetter

no it was not loose, the screw was correctly fixed, so I assume the drive could not move.

2disbetter · December 8, 2023, 8:03am

Ok, well hopefully it was just a slight glitch. Now that you have reseated the drive, and the connection is good and solid, we’ll see if it holds. It should. If it does not, contact support please. They’ll be able to remedy the problem at that point.

Jorg_Mertin · December 8, 2023, 8:32am

I would also check the smart log (under linux, you can use “smartctl --all /dev/nvme0” or nvme tools with: “sudo nvme smart-log /dev/nvme0”

check the device names with “lsblk” or “nvme list”

I just had a Samsung 980 NVMe disk failing on my server because of some critical media errors that happened to be at the “beginning” of the disk invalidating all existing boot blocks which resulted in the system not recognizing it anymore.
Usually, NVMe SSD’s will remap the bad block transparently if possible (means, when not in use). But if bad blocks on a fairly new disk happen, this can go to fatal issue fast.

Make sure there are no Media and Integrity Errors showing up in your log.
See this one as example:

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    97%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    28,856,115 [14.7 TB]
Data Units Written:                 49,244,732 [25.2 TB]
Host Read Commands:                 268,782,702
Host Write Commands:                672,738,666
Controller Busy Time:               2,586
Power Cycles:                       106
Power On Hours:                     5,141
Unsafe Shutdowns:                   17
Media and Data Integrity Errors:    38   <=== This
Error Information Log Entries:      38   <=== This shows the existing log entries
Warning  Comp. Temperature Time:    226
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               35 Celsius
Temperature Sensor 2:               39 Celsius
Thermal Temp. 2 Transition Count:   59326
Thermal Temp. 2 Total Time:         18032

According to a Samsung engineer, remapping should not show media and data integrity errors as these can be remapped. If that shows up, something is wrong with the silicone and needs replacing.
I got a replacement disk from Samsung in 5 days.

2disbetter · December 8, 2023, 9:23am

If this is on Linux, then a few more details about the distro you are using etc, would be beneficial to this thread.

baptistemm · December 8, 2023, 4:39pm

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    250,961 [128 GB]
Data Units Written:                 619,864 [317 GB]
Host Read Commands:                 1,826,774
Host Write Commands:                7,036,369
Controller Busy Time:               14
Power Cycles:                       185
Power On Hours:                     4
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

I use fedora 39

Richard_Lee1 · December 8, 2023, 5:01pm

If I were you, I’d clean the contact with electronics cleaner or just alcohol.
I’ve actually run into issues like this with PCIe cards.
And I see too many people putting their fingers on these little pads.

Jorg_Mertin · December 10, 2023, 12:52pm

Power on hours very low, power cycles however quite high. All spares still available.
Eventually you could initiate a self test (fast and long). But IMHO that will not show us anything new. Just make sure the contacts are clean as @Richard_Lee1 mentioned and see how it goes.

baptistemm · March 18, 2024, 6:59am

Got this again yesterday night, I was on my laptop and it rebooted.
The storage device was not visible.
I put it this way on my desk, and this morning it boots …

(⎈|minikube:default)➜  ~ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning			: 0
temperature				: 32 °C (305 K)
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 0%
endurance group critical warning summary: 0
Data Units Read				: 1396065 (714.79 GB)
Data Units Written			: 3658385 (1.87 TB)
host_read_commands			: 12529858
host_write_commands			: 66415326
controller_busy_time			: 197
power_cycles				: 379
power_on_hours				: 68
unsafe_shutdowns			: 17
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0

Jorg_Mertin · March 18, 2024, 8:16am

What does sudo journalctl -r (shows the system log in reverse).
What you show us is the disk’s smartlog which won’t help.

baptistemm · March 18, 2024, 9:11am

that was the first this I looked at, and there was nothing relevant.

Jorg_Mertin · March 18, 2024, 9:34am

Ok. Go into ther BIOS and load failsafe defaults. reboot and make sure it boots.
Shutdown (realy shutdown, wait 35secs) and boot again into BIOS and load the performance defaults.

Also, check if there are some bios updates for the drive.
All I could advise to do, except in putting a different drive into it and see if that one has the same symptoms.

Matt_Hartley · March 22, 2024, 12:29am

Not sure where you purchased the drive from, but if it happens again, it may be ticket worthy if the connections between the NVMe and the slot look healthy.