You still don’t have the firmware updated properly. Assuming it’s put into the filesystem properly, maybe it’s included in your initramfs and you forgot to rebuild it?
2024-08-31 17:18:16,223 DEBUG: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
2024-08-31 17:18:16,223 DEBUG: amdgpu 0000:c1:00.0: firmware: failed to load amdgpu/gc_11_0_1_mes_2.bin (-2)
2024-08-31 17:18:16,223 DEBUG: firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
2024-08-31 17:18:16,223 DEBUG: amdgpu 0000:c1:00.0: firmware: failed to load amdgpu/gc_11_0_1_mes_2.bin (-2)
2024-08-31 17:18:16,223 DEBUG: amdgpu 0000:c1:00.0: Direct firmware load for amdgpu/gc_11_0_1_mes_2.bin failed with error -2
2024-08-31 17:18:16,223 DEBUG: [drm] try to fall back to amdgpu/gc_11_0_1_mes.bin
2024-08-31 17:18:16,223 DEBUG: amdgpu 0000:c1:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes.bin
2024-08-31 17:18:16,223 DEBUG: amdgpu 0000:c1:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes1.bin
When you’ve done it properly that script won’t complain anymore.
And now the script runs without complaint! But the computer still takes over a half-minute to wake. I think a few seconds worse than when I started, not sure. (And draw on the battery when unplugged as as bad, maybe a little worse than when I started, not sure.)
Grrr.
-kb, the Kent who feels like he is making progress, but on the wrong axis.
If you don’t have any sort of passwords set in the firmware the next thing I would suggest you do is check your NVME firmware version against the latest that is present on the manufacturer’s website. Most manufacturers don’t publish firmware updates for their disks for Linux unfortunately.
Since you’re seeing a page fault from the NVME disk in the interim some workarounds you can experiment to see if they help are either turning off the IOMMU (amd_iommu=off on kernel command line) or putting it in passthrough mode (iommu=pt on kernel command line).
@Kent_Borg There’s a thread here for updating WD SNX50 NVMe drives without WD’s official Windows FW update tool… Apparently somebody even wrote a python tool for updating the FW under linux (which i haven’t used - ymmv)… all in the thread linked.
I’m suspicious that I have a bad component. (I’ve had my /boot partition get corrupted twice. Yes, I have been messing with grub stuff at the time, so maybe I messed it up, that is why I am suspicious and not certain. smartctl -a /dev/nvme0 doesn’t show any obvious errors. I am running btrfs for / and /boot and when I scrub I get no crc errors.)
Yes, i have the same drive with the same firmware rev.
Rather than fiddeling with a system where one cannot determine it’s current state properly, i’d much rather try something more recent than debian with cherry-picked backports, e.g. a clean Fedora installation or one of the Arch Linux derivates, or try swapping the nvme drive, if you can.
Because, as @Mario_Limonciello pointed out above, the IOMMU errors and the resetting of the nvme controller are surely not conducive for the process and should be alarming imo, even without suspend/resume cycle issues.
@Kent_Borg Well, what can i tell you… i have the same drive (WD_BLACK SN850X 2000GB) with the same firmware (620361WD) and i don’t have these (lacking nvme s2idle support & nvme controller reset) issues, so perhaps worth a try…
Edit: wrote that already… apparently, i need some sleep…
I meant trying any other compatible nvme drive and doing the above checks (s2idle script) again and compare results as a process of elemination…
So I bought a new SSD, a 2TB Crucial, and it works. Not as fast as the WD but cheaper. Now wake from suspend is fast.
I also took the opportunity to install Debian testing/Trixie on the new SSD, and so far, after a few days of futzing I am close to declaring victory. It is looking good. Not sure Trixie is any better than Debian 12, but it looks no worse.
Next: Once I am sure I have everything I might want off the old SSD, figure out how to get WD to replace the old one, then move everything across to the replacement and if the replacement SSD works, put the Crucial into external backup service.
Thanks to all for so much patient help on this.
-kb, the Kent who would click a “Solution” box, but he doesn’t have such power.
I was using btrfs on the bad SSD, and because btrfs does CRC checks of stored data and metadata the odds good that the data I have copied off is all good.
Some stuff I have read makes it sound like the CRC feature is only useful with redundant disk arrays that can self-heal, but in this case it is so nice to know for the weeks I was using it the defective WD SSD didn’t rot my bits. (Though the fact I wasn’t getting any CRC errors did make me slow to suspect the WD SSD was bad.)
Unless you deliberately changed some setting even with single disk it would have been pretty obvious if you had data errors when copying off, you just could not heal them.
You could set it up to have multiple copies on the same drive to recover from errors obviously at the cost of stuff taking up more space or you could enable the mount parameter to let you read over checksum errors but if you have not done that and got a full coppy without any io errors you should be fine.