Oh no! That crash has eaten my Linux installation!

Has anyone else hit something like this?

I have a Framework 13" AMD with a Samsung 990 Pro NVMe stick, still on 3.03 firmware, running Debian 12/Trixie, with SMB shares mounted via CIFS tools and connected through a wired USB-C dongle.

Three times, I’ve had a hard crash when accessing files from the network – so hard that the root partition has suffered corruption and needed a full fsck and eventual reinstall.

I think that power management has put the USB network dongle to sleep, but doesn’t wake it in time so that the file access fails in a horrendous crash. I hope this doesn’t happen to you – has anyone else seen this?

K3n.

I’ve had similar things happen on my systems, but not any time recently. I think I was using an early version of the btrfs file-system on my boot drive the last couple times I saw them.

(I love btrfs, but I do NOT use or recommend it [or any checksummed file system] on your boot drive, unless you have it in RAID1 or a similar RAID mode. btrfs, at least, won’t let you even access any file that doesn’t match its checksum, and if that file is at all important to the boot process, the system can’t boot, period. At least non-checksummed file systems can usually still boot when a file is only slightly corrupted.)

I had two of these hard crashes about a week ago, and yes, one of them left the machine unbootable (but salvageable by booting from a flash drive and running fsck.ext4), and the other ate my Chromium profile and one git repository (which resulted in one directory that can be only renamed but not deleted; but I can’t be bothered to reinstall…)

I had a 9 day streak without those hard crashes after updating to BIOS 3.05, so I would recommend you to do that first. But you will also need to update your AMD GPU firmware, because neither stable, Trixie or Sid have a version that is recent enough to fix all known issues.

I thought I updated my firmware weeks ago, but after one more “soft” crash yesterday (GPU crash only, i.e., the system is still running and accessible over SSH, so does it not result in FS damage), I found out I needed to update initrd to actually make the replaced firmware load during boot, as described here:

Today I updated the “InstallingDebianOn” guide for AMD 7040 series Framework, you can take a look at that as well:

1 Like

You can disable that now and imo it’s a lot better to know you have a borked system than have it partially work.

2 Likes

Ah, I wasn’t aware of that. Thank you for bringing it to my attention.

I can understand your point, but I can’t agree with it. When I sit down at my desktop system to get work done, I need to get that work done. If something is wrong with a boot-drive file that prevents me from booting up the system, then I have to spend time investigating, reinstalling, and setting up the system again before I can do the work.

On my laptop, the situation is worse because I’m usually away from home. If I can’t boot it up, I can’t do anything until I get home and use a working system to get it running again. I’ve partly offset that by always carrying around a thumb drive with the “live” Ubuntu installer for the version I’m using, but that’s a poor substitute for a working system.

If either system worked, in a degraded form, there’s a good chance that I can get my work done, and then reinstall later when I have some free time.

The best of both worlds would be if btrfs would let the system (attempt to) boot, but the system would then immediately notify the user of the problem. I’m not sure if that’s offered yet, I need to look into the newer capability that you’ve brought to my attention.

There is still something wrong with your boot drive that can cause who knows what, if it is something that prevents booting with it throws an io error while reading it’s probably pretty important.

Great the system boots but it’s also just randomly crashing in the middle of the work you are trying to get done or is doing who knows what kind of data corruption in the background.

With btrfs you can still boot your garbled system but at you’ll know what files are damaged so a full reinstall is a lot less likely to be required.

You can set it up like that if you want to

I have the same problem as you with a Samsung 990 Pro 4TB under Debian 12.7. The SSD crashes my system once a month with write errors and data loss once out of 2. Everything works perfectly previously with a 970 Evo Plus 2TB. Samsung refuses to acknowledge the problem I sent it in RMA, it was reshipped to me 24 hours after its reception without the slightest technical or commercial explanation ;-(

https://image.noelshack.com/fichiers/2024/39/4/1727378250-img-2447.jpg

https://image.noelshack.com/fichiers/2024/39/4/1727378311-signal-2024-09-05-200731-004.jpeg

I’d like to raise the counterpoint: Why risk booting a corrupted file when you could install with zfs on your boot drive, put ZFSBootMenu in your EFI System Partition, and run something like zrepl to have automatic periodic snapshots (personally I snapshot every 15 minutes, and I keep 2 weeks of snapshots). That way. if you run into corrupted boot files you can either use ZFSBootMenu to rollback to a recent snapshot, or you can use ZFSBootMenu to run zpool status -v to get a list of the specific corrupted files and then pull a recent version of those specific files out of old snapshots. Using a non-checksummed filesystem is like using a deli slicer without gloves.

Not with the setup I described above! It is trivial and mindless to simply navigate over to the snapshot tab in ZFSBootMenu and pick a recent snapshot to rollback to.

Which is why ZFSBootMenu is so great. Its a self-contained EFI executable that contains a minimal linux system capable of mounting your ZFS filesystems so not only can you do various ZFS operations like snapshot/rollback but you can chroot into your zfs filesystem to use your programs / edit your files.

I suspect you’ve never had to deal with a corrupted file with that setup.

It’s quite possible that ZFS has some option I’m not aware of to get around this, but in general, a snapshot only makes a link to the files that are already on the system. There’s still only one copy of each file. If that copy gets corrupted, you’re still up the proverbial creek without the proverbial paddle.

The only way around that problem is to have two separate physical drives, with a RAID1-like configuration. I’m using that on a NAS server, and it works beautifully – but takes twice as much space, and obviously requires a second physical device. Easily possible on a desktop, less easily on a laptop, and depending on how much you’re willing to spend on storage, it can be painful as well.

Ah well if we’re talking about a file that never changes (so we only have one copy despite snapshots), and staying on one SSD, you can always run zfs set copies=2 <dataset> on any important datasets so you have two (or more if you’d like) copies of those files on your disk. If all of them got corrupted simultaneously then we’re likely talking about a dead drive that you couldn’t boot from with a non-checksumming filesystem either.

That’s the option I wasn’t aware of. Looks like BTRFS has a similar option (dup), though there’s a warning about it on SSDs that I presume would apply to ZFS as well.