Wake up From Suspend…Tricky

That package is out of date and it causes major issues on Phoenix systems. I’ve asked and dozens of other people have asked. Debian isn’t updating it in stable for whatever reason.

I already grabbed some AMD firmware from a testing .deb, let me try the same with the .deb for firmware-amd-graphics.

-kb

So that had a lot of, well, interactions with other stuff. Looked like it was sharing directories with other packages, so I tried having apt do it for me. I added Trixie/testing/Debian 13 to the apt sources, and let it install it!

But shortly after I did that my keyboard died. Rebooting didn’t help. Luckily and external USB keyboard would still work, so I used that to put things back to the way they were.

-kb

All you need is the amdgpu binaries, you don’t need anything else.

You can get it from upstream.

1 Like

Dumb question time: Where do I put them? This machine has more than one firmware directory.

—ACTUALLY, I think there is only one amdgpu directory. I’ll that there.

Thanks,

-kb

Wake is still slow. (And power consumption on battery still high…)

New https://www.borg.org/s2idle_report-2024-08-31.txt

Thanks,

-kb

You still don’t have the firmware updated properly. Assuming it’s put into the filesystem properly, maybe it’s included in your initramfs and you forgot to rebuild it?

2024-08-31 17:18:16,223 DEBUG:	amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
2024-08-31 17:18:16,223 DEBUG:	amdgpu 0000:c1:00.0: firmware: failed to load amdgpu/gc_11_0_1_mes_2.bin (-2)
2024-08-31 17:18:16,223 DEBUG:	firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
2024-08-31 17:18:16,223 DEBUG:	amdgpu 0000:c1:00.0: firmware: failed to load amdgpu/gc_11_0_1_mes_2.bin (-2)
2024-08-31 17:18:16,223 DEBUG:	amdgpu 0000:c1:00.0: Direct firmware load for amdgpu/gc_11_0_1_mes_2.bin failed with error -2
2024-08-31 17:18:16,223 DEBUG:	[drm] try to fall back to amdgpu/gc_11_0_1_mes.bin
2024-08-31 17:18:16,223 DEBUG:	amdgpu 0000:c1:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes.bin
2024-08-31 17:18:16,223 DEBUG:	amdgpu 0000:c1:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes1.bin

When you’ve done it properly that script won’t complain anymore.

Okay: dpkg-reconfigure linux-image-6.1.0-25-amd64.

And now the script runs without complaint! But the computer still takes over a half-minute to wake. I think a few seconds worse than when I started, not sure. (And draw on the battery when unplugged as as bad, maybe a little worse than when I started, not sure.)

Grrr.

-kb, the Kent who feels like he is making progress, but on the wrong axis.

P.S. https://www.borg.org/s2idle_report-2024-09-01.txt

Is this a SED or do you have a BIOS password set in firmware?

I noticed an nvme page fault.

2024-09-01 08:24:18,648 DEBUG: nvme 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x58e5e000 flags=0x0000]

If you have a BIOS password set for your storage please turn it off.

I have an admin password that it is supposed to ask for on power up. O(Sometimes it does not, however.)

I’ll try turning it off.

Thanks,

-kb

Turns out I reset the BIOS recently and do not have any passwords set now.

I tried with these firmware files and the 6.10.7 kernel I built a couple days ago. Same slow wake up (and same power consumption).

https://www.borg.org/s2idle_report-2024-09-01_on_6.10.7.txt

-kb

Definitely your issues that are coming from the slow wake up are caused by NVME not coming back properly.

2024-09-01 17:24:08,564 DEBUG: nvme 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x58e5e000 flags=0x0000]

2024-09-01 17:24:08,564 DEBUG: nvme nvme0: 12/0/0 default/read/poll queues
2024-09-01 17:24:08,564 DEBUG: nvme nvme0: resetting controller due to AER
2024-09-01 17:24:08,564 DEBUG: nvme nvme0: Identify namespace failed (-4)
2024-09-01 17:24:08,564 DEBUG: nvme nvme0: 12/0/0 default/read/poll queues

If you don’t have any sort of passwords set in the firmware the next thing I would suggest you do is check your NVME firmware version against the latest that is present on the manufacturer’s website. Most manufacturers don’t publish firmware updates for their disks for Linux unfortunately.

Since you’re seeing a page fault from the NVME disk in the interim some workarounds you can experiment to see if they help are either turning off the IOMMU (amd_iommu=off on kernel command line) or putting it in passthrough mode (iommu=pt on kernel command line).

1 Like

@Kent_Borg There’s a thread here for updating WD SNX50 NVMe drives without WD’s official Windows FW update tool… Apparently somebody even wrote a python tool for updating the FW under linux (which i haven’t used - ymmv)… all in the thread linked.

1 Like

Wow! I am impressed.

Working my way through the twisty passages it seems I need firmware <fwversion>620361WD</fwversion> but it seems I am already in that:

root@theseion:/home/kentborg# cat /sys/class/nvme/nvme0/firmware_rev
620361WD

I’m suspicious that I have a bad component. (I’ve had my /boot partition get corrupted twice. Yes, I have been messing with grub stuff at the time, so maybe I messed it up, that is why I am suspicious and not certain. smartctl -a /dev/nvme0 doesn’t show any obvious errors. I am running btrfs for / and /boot and when I scrub I get no crc errors.)

-kb

Yes, i have the same drive with the same firmware rev.

Rather than fiddeling with a system where one cannot determine it’s current state properly, i’d much rather try something more recent than debian with cherry-picked backports, e.g. a clean Fedora installation or one of the Arch Linux derivates, or try swapping the nvme drive, if you can.

Because, as @Mario_Limonciello pointed out above, the IOMMU errors and the resetting of the nvme controller are surely not conducive for the process and should be alarming imo, even without suspend/resume cycle issues.

Swapping to a different model (I thought I got the same model Framework sells) or a different instance (because mine might be bad)?

-kb

@Kent_Borg Well, what can i tell you… i have the same drive (WD_BLACK SN850X 2000GB) with the same firmware (620361WD) and i don’t have these (lacking nvme s2idle support & nvme controller reset) issues, so perhaps worth a try…

Edit: wrote that already… apparently, i need some sleep…

I meant trying any other compatible nvme drive and doing the above checks (s2idle script) again and compare results as a process of elemination…

So I bought a new SSD, a 2TB Crucial, and it works. Not as fast as the WD but cheaper. Now wake from suspend is fast.

I also took the opportunity to install Debian testing/Trixie on the new SSD, and so far, after a few days of futzing I am close to declaring victory. It is looking good. Not sure Trixie is any better than Debian 12, but it looks no worse.

Next: Once I am sure I have everything I might want off the old SSD, figure out how to get WD to replace the old one, then move everything across to the replacement and if the replacement SSD works, put the Crucial into external backup service.

Thanks to all for so much patient help on this.

-kb, the Kent who would click a “Solution” box, but he doesn’t have such power.

I was using btrfs on the bad SSD, and because btrfs does CRC checks of stored data and metadata the odds good that the data I have copied off is all good.

Some stuff I have read makes it sound like the CRC feature is only useful with redundant disk arrays that can self-heal, but in this case it is so nice to know for the weeks I was using it the defective WD SSD didn’t rot my bits. (Though the fact I wasn’t getting any CRC errors did make me slow to suspect the WD SSD was bad.)

-kb, the Kent who has become a fan of btrfs.

Unless you deliberately changed some setting even with single disk it would have been pretty obvious if you had data errors when copying off, you just could not heal them.

You could set it up to have multiple copies on the same drive to recover from errors obviously at the cost of stuff taking up more space or you could enable the mount parameter to let you read over checksum errors but if you have not done that and got a full coppy without any io errors you should be fine.