[RESPONDED] SSD Does not Wake up After Suspend - AMD

Ablomme · January 30, 2024, 8:05am

Hi everyone,

I have been facing a weird issue ever since I got my laptop where occasionally waking up from suspend will not properly wake up the SSD.
This causes the filesystem to become read-only, the DE will start showing visual bugs, programs will not run, and shutting down will not work: you will have to hard reset.

I am running Arch with kernel 6.7.2, BIOS 3.0.3.
My SSD in the SK Hynix P41 Platinum.

I’m not sure if this issue is specific to framework, or if it’s an issue with my SSD. I would like to know if anyone else is also facing similar issues.

The weird thing is that this issue does not happen consistently: suspend works the vast majority of times.

journalctl only displays the following entry:

Jan 29 12:04:58 PC systemd-sleep[7851]: Performing sleep operation ‘suspend’…

This is the last entry from the boot, so obviously logs are missing since the SSD is not operational until I reset.

I have found a similar post here in 2022 which I believe is the same issue, but occurring on a Sabrent SSD and Intel 11th gen: [TRACKING] Resuming from deep sleep fails on Linux (SSD unresponsive?)

Additionally, I have found a solution on the Arch wiki about what I believe is the same problem: Solid state drive/NVMe - ArchWiki

Which says to add

amd_iommu=off

to the kernel parameters. I know there have been some issues with iommu, so I will be doing this and I’ll report back if it is still happening. I am just creating this post to see if anyone else is experiencing similar issues.

Matt_Hartley · January 30, 2024, 11:28pm

We have not actively tested this against Arch, however, most Arch threads dealing with kernel issues seem to land on this - use linux-lts.

See if it happens there as well.

lbkNhubert · January 31, 2024, 1:56am

Same drive, arch on 11th gen intel, kernel 6.7.2-arch1-1, kernel parameter (apparently not to be used on AMD) nvme.noacpi=1, no issues. So this seems possibly to be AMD and distro related.

jared_kidd · January 31, 2024, 4:58am

Arch is also my preferred Linux distro and I really hope we don’t have similar issues on the 16. I’d much rather Framework fix buggy firmware instead of having to work around them with kernel options and module blacklisting.

That said, one of the reasons more popular distros like Ubuntu appear to work better is because they include a vast majority of these buggy firmware workarounds out of the gate. Take a look at all the modules blacklisted by default on an Ubuntu system for example.

I wish Framework tested on the more “pure” distros like (Debian, Arch, etc) so that potential issues aren’t hidden with existing workarounds.

GhostLegion · January 31, 2024, 3:17pm

There’s always Fedora, my understanding is they have pretty strict guidelines on what can be included out of the box

Mario_Limonciello · January 31, 2024, 3:43pm

I think you need to find out what kernel messages are coming up when this happens to confirm what’s going on.

You can try to run dmesg -w in a terminal window while you suspend. Then when you wake up again hopefully the messages you need to see will be visible even if the disk didn’t come back.

If you haven’t looked already, see if your SSD has a firmware update available. Unfortunately most SSD manufacturers don’t publish their updates to LVFS, so you might need to update it under Windows or with their proprietary tooling.

With framework 13 AMD? I suggest you guys compare ssd firmware versions.

GhostLegion · January 31, 2024, 3:46pm

Funny, I don’t know how I missed that in the title. I’ll be deleting my earlier post now.

Mario_Limonciello · January 31, 2024, 3:49pm

No worries.

@Matt_Hartley
Is this ssd model in the list that framework tests? If not there is always a possibility of a compatibility problem too.

Matt_Hartley · January 31, 2024, 6:52pm

@GhostLegion this is excellent advice and should shed some light.

I am waiting to hear back on this, however, we generally live in a WD Black world here. I seem to recall (last year I believe), seeing issues with Hynix, but I may be remembering incorrectly.

Once I have a final yay or nay that we know for sure, I will update here.

halemmerich · January 31, 2024, 7:11pm

I use that SSD with Arch on the AMD 13 and do not see this problem. fwupdmgr shows Current version: 51060A20 but I have no idea how that relates to the actual firmware version on the SSD. Currently on 6.7.2-zen1-1-zen but I have been using this SSD from the start and have used most kernels versions in the Arch repos since then.

Matt_Hartley · January 31, 2024, 7:11pm

Heard back from engineering. Tested on FW 16, not the AMD 13 inch. Likely fine, however, completely untested on your laptop model. But per engineering, should be okay.

Matt_Hartley · January 31, 2024, 7:12pm

This is a good jumping off point.

Ablomme · January 31, 2024, 11:17pm

Same firmware version as me, but I’m not using the zen kernel. May I ask if you are using luks?

As an update: I found multiple forums (Including one from frame.work) having the same issue, with many giving advice to run fstrim. I thought I had trim working, but it turns out LUKS does not allow trim unless you pass a kernel parameter to explicitly allow it (I guess for some privacy reason). I have since fixed this, but only 35GB was trimmed, so not sure if this was the issue. I’ll report back if it happens again and I’ll run dmesg.

edit: Turns out it actually trimmed around 1.5 TB.

Thank you everyone for your comments!

halemmerich · February 1, 2024, 5:52pm

Yes, it’s a simple GPT layout without LVM. Just LUKS on 2 partitions and btrfs/swap in those encrypted partitions. I do not have trim/discard enabled currently but had no problems with that in the past.
The security implications of trim are probably mainly that an attacker could detect free space which looks different than the encrypted actual data. So there can be some deductions made on what file system is used and maybe file sizes. There may be other stuff, but for personal use I have no problem with trim.

If you have trimmed 1.5T, how full is your SSD? Mine is 2T and about 1.6T is currently used.

Ceremony · February 2, 2024, 2:55am

My previous SSD also ran into this issue, though it happened 100% of the time: [RESPONDED] NVMe is lost after resuming from sleep FW13 AMD

Since then I got a new SSD which works just fine, even after suspending multiple times. Never got the previous to work after suspending it, though.

Ablomme · February 3, 2024, 10:30pm

About 500GB full. Turns out fstrim trims all unused space regardless of if it has already been trimmed on a different boot, so that is why it is 1.5TB.

I cannot replicate the error anymore, so everything seems good so far. I have re-enabled iommu and no issues. I’ll report back if I ever experience this issue again, but so far so good.

Ablomme · March 15, 2024, 8:35am

update: this issue has happened twice since my last comment.
Once it happened without ever going into suspend.
All times it seems to only happen while running a virtual machine (I don’t know if it is to blame or not).

When the issue happened, I could not open a terminal in order to run dmesg; it hanged. Even trying to go to a different virtual terminal did not work; it hanged and I had to hard poweroff. I am now keeping dmesg -w running always just in case.

Ablomme · May 5, 2024, 1:45am

I can confirm that the logs are identical to those shown in the arch wiki article. I’ll try turning off iommu.
The very first log is:

nvme nvme0: controller is down; will reset; CSTS=0x3, PCI_STATUS=0x10

did some investigation in the source code and looks like the controller status (CSTS) has the Controller Fatal Status (CFS) bit set (0x2).

when I search “controller is down” into google, the first link is this page: Nvme0: controller is down; will reset. So perhaps related to framework, or perhaps related to the SSD. If I can’t fix it I’ll just buy a new SSD.

Mario_Limonciello · May 5, 2024, 1:57am

Try looking for an SSD firmware update.

Ablomme · August 6, 2024, 2:25am

Replaced the ssd with a 990 pro months ago, and have not experienced any issues since