I have a problem that only recently emerged: Sometimes my NVME SSD will just be inaccessible.
I cant really troubleshoot it, because once this state is reached, I cant use a terminal anymore.
I did a quick search online, but most people just say, it’s just the ssd dying (I checked smart values and they seemed fine, e2fsck reported no problems either) or a firmware problem (no new firmware for my lexar NM710).
Does anyone know how I could troubleshoot this problem?
Some more things I know:
Nextcloud-Desktop already died once I notice it.
Most of the time I only notice it when trying to save a file or trying to use a terminal
Kernel 6.9.1 worked without problems for a month, the problem started a week or two ago
I was running a tuxedo-neon-ubuntu frankenstein distro until this monday (had the sources all mixed up) but that is since 2 months already
I can still poweroff with sysrq although the filesystem syncing (s) and rebooting (b) arent working
Don’t trust that SSD. It might still be good – but in my experience, the symptoms that you’re reporting are a prelude to worse things to come. Maybe something gradual, but maybe a sudden and full no-workee-anymore. Keep your backups up-to-date, and I’d suggest using a disk format that checks the integrity of every file, like BTRFS or ZFS, if at all possible.
With that out of the way: sorry, I don’t know of any way to troubleshoot something like this that happens at random, and that you can’t do anything with once it does.
I just recently had to replace a failing HDD in a RAID 5, despite the fact that all of the S.M.A.R.T. values were passing ok. I ran a more in-depth device test, and both it and another drive in the array failed gloriously. I replaced all drives in the array (given that they were all the same model, bought at the same time, etc.) and was able to avoid losing any data. I know it’s not an SSD, but the learning may still apply, the S.M.A.R.T. failure heuristics are not the end-all-be-all of device diagnostic.
I have also had to replace an SSD in another RAID that did begin to trip the S.M.A.R.T. detectors, even though it was operating perfectly fine as far as I could tell. So I’ve seen the flip side of the coin as well.
The smartctl long test is what I ran, perhaps running it could help you diagnose your issue as well. Luckily in that situation, it was all business expensible, so if it simply looked like it might be failing, a couple hundred bucks was a no-brainer spend compared to the downtime and loss it could cost. https://www.cyberciti.biz/tips/linux-find-out-if-harddisk-failing.html
Oh, thats interesting, thanks for the extensive answer!
I received a new SSD replacing the current one today and will make a move to kde neon on ZFS. Once my current SSD has been replaced, I will have 2 SSDs and be able to use a ZFS mirror.
After digging a bit into this topic, I will try the 770 nevertheless and definitely use another SSD from another manufacturer as counterpart in the large slot.
I will report back in a month or two, unless the 770 dies on my the first day xD
I have it installed now, struggled a lot with installing 24.04 with zfs, but it seems to work mostly now.
Just need to figure out how to setup the swap encryption to use it for hibernation.