[RESPONDED] NVMe is lost after resuming from sleep FW13 AMD

Ceremony · November 18, 2023, 3:12pm

I cannot get suspend to work properly, or rather waking it back up:
After awakening, the nvme device is lost and thus the system freezes, crashes and panics.

It can be replicated on a live distro as well, where the nvme just disappears and remains unreachable.

I have tried to following boot commands to mitigate the issue, but none of them worked:

nvme.noacpi=1
rtc_cmos.use_acpi_alarm=1
mem_sleep_default=deep
pcie_aspm=disabled
acpiphp.disabled=1
nvme_core.default_ps_max_latency_us=10000

The arch wiki recommends s3 sleep instead of s2idle when running into this issue, but when checking /sys/power/mem_sleep, only s2idle is available. Is s3 sleep unavailable on AMD frameworks?

The NVMe in question is an (old and crappy) 1TB ADATA SX6000. Suspension worked fine with the same drive in my older AMD laptop (featuring an 3500U)

P.S. hibernate works fine

Matt_Hartley · November 20, 2023, 8:31pm

Ceremony:

I cannot get suspend to work properly, or rather waking it back up:
After awakening, the nvme device is lost and thus the system freezes, crashes and panics.

It can be replicated on a live distro as well, where the nvme just disappears and remains unreachable.

I have tried to following boot commands to mitigate the issue, but none of them worked:

nvme.noacpi=1

rtc_cmos.use_acpi_alarm=1

mem_sleep_default=deep

pcie_aspm=disabled

acpiphp.disabled=1

nvme_core.default_ps_max_latency_us=10000

The arch wiki recommends s3 sleep instead of s2idle when running into this issue, but when checking /sys/power/mem_sleep, only s2idle is available. Is s3 sleep unavailable on AMD frameworks?

The NVMe in question is an (old and crappy) 1TB ADATA SX6000. Suspension worked fine with the same drive in my older AMD laptop (featuring an 3500U)

P.S. hibernate works fine

Hi @Ceremony, we see a number of posts here about issues with Arch with AMD Ryzen 7040 series.

And while we don’t officially support Arch:

s2idle is it for suspend, so that is correct. Going through the parameters:

YES: Prevents waking after five minutes:
rtc_cmos.use_acpi_alarm=1

NO: This will cause major problems for suspend on 7040 Series.
nvme.noacpi=1

The rest of them are not needed.

My guess is you had nvme.noacpi=1 set, and were trying to address issues caused by it with the other parameters.

Ceremony · November 20, 2023, 10:00pm

I have tested just the rtc_cmos.use_acpi_alarm=1 param without any of the other options, but the nvme still disappeared after sleep, so nothing really helped unfortuantely.
I’ll try a Ubuntu 22.04 live distro on a usb stick tomorrow, whether that makes any difference to rule out an arch compatiblity issue.

Also, I ordered a new pcie drive for my framework, so maybe the issue disappears with the new drive (a Kingston KC3000 2TB), so we might be able to nail it down to an incompatible drive. as for how we go from there, we’ll see!

Ceremony · November 21, 2023, 11:32am

Here is the update:
Ubuntu 22.04 live distro also loses my ADATA SX6000 after sleeping.

In the meantime, my new SSD, a Kingston KC3000, arrived and it was able to sleep just fine within Ubuntu. I am installing an arch distro now and see how that goes…

So the issue is definitely with the SX6000. How do you want to proceed with this?

To me, this is no longer an issue, as I won’t be using the problematic drive with my Framework, but others might stumble into this issue with similar drives… So do you wanna continue to troubleshoot it and fix the issue? or just pray that nobody else uses this kind of nvme?

Matt_Hartley · November 21, 2023, 7:34pm

I think going with the new drive is the path forward. I suspect there is a conflict with the older drive somewhere that ventures into firmware territory. Sticking with the Kingston KC3000 is my recommendation.

Ceremony · November 21, 2023, 8:35pm

That was the plan

Kellerkind · May 1, 2024, 10:13am

Have a similar issue when I have set a Master Password to my NVMe and coming back from suspend.
It needs a reset on the PCIe Bus to properly come back.

The 3.05 BIOS changelog also mentions, they fixed a total system hang when a password is set and coming back from s2idle.

Removing the password also resolves this issue.
Will attach a dmesg snip in a minute.

dmesg when password is not set:

[ 3814.286698] PM: suspend entry (s2idle)
[ 3814.289851] Filesystems sync: 0.003 seconds
[ 3814.292098] Freezing user space processes
[ 3814.293700] Freezing user space processes completed (elapsed 0.001 seconds)
[ 3814.293705] OOM killer disabled.
[ 3814.293706] Freezing remaining freezable tasks
[ 3814.294788] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[ 3814.294793] printk: Suspending console(s) (use no_console_suspend to debug)
[ 3814.310286] queueing ieee80211 work while going to suspend
[ 3814.457132] ACPI: EC: interrupt blocked
[ 3824.204140] ACPI: EC: interrupt unblocked
[ 3824.445462] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[ 3824.445528] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 3824.446019] nvme nvme0: 16/0/0 default/read/poll queues
[ 3824.447998] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 3824.554245] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 3824.554738] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 3824.555071] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 3824.555075] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 3824.555077] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 3824.555079] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 3824.555081] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 3824.555082] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 3824.555084] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 3824.555086] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 3824.555088] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 3824.555090] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 3824.555092] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 3824.555094] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 3824.555096] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 3824.559635] [drm] ring gfx_32771.1.1 was added
[ 3824.560219] [drm] ring compute_32771.2.2 was added
[ 3824.560744] [drm] ring sdma_32771.3.3 was added
[ 3824.560772] [drm] ring gfx_32771.1.1 ib test pass
[ 3824.560801] [drm] ring compute_32771.2.2 ib test pass
[ 3824.560919] [drm] ring sdma_32771.3.3 ib test pass
[ 3824.612945] OOM killer enabled.
[ 3824.612950] Restarting tasks ... done.
[ 3824.615207] random: crng reseeded on system resumption
[ 3824.620270] PM: suspend exit

dmesg when password is set:

[   56.576486] PM: suspend entry (s2idle)
[   56.580129] Filesystems sync: 0.003 seconds
[   56.582471] Freezing user space processes
[   56.583804] Freezing user space processes completed (elapsed 0.001 seconds)
[   56.583808] OOM killer disabled.
[   56.583809] Freezing remaining freezable tasks
[   56.584999] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   56.585003] printk: Suspending console(s) (use no_console_suspend to debug)
[   56.597581] atkbd serio0: Disabling IRQ1 wakeup source to avoid platform firmware bug
[   56.603297] queueing ieee80211 work while going to suspend
[   57.402966] pcieport 0000:00:08.3: quirk: disabling D3cold for suspend
[   57.404657] ACPI: EC: interrupt blocked
[   62.712047] ACPI: EC: interrupt unblocked
[   99.036750] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 36403428731 wd_nsec: 36403413348
[   99.037284] nvme 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x58e5e000 flags=0x0000]
[   99.046397] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[   99.046465] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[   99.046971] nvme nvme0: 16/0/0 default/read/poll queues
[   99.047137] nvme nvme0: resetting controller due to AER
[   99.050301] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[   99.051025] nvme nvme0: Identify namespace failed (-4)
[   99.068064] nvme nvme0: 16/0/0 default/read/poll queues
[   99.158896] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[   99.159401] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[   99.159739] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[   99.159743] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[   99.159746] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[   99.159748] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[   99.159750] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[   99.159752] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[   99.159753] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[   99.159755] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[   99.159758] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[   99.159759] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[   99.159762] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[   99.159764] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[   99.159766] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[   99.164348] [drm] ring gfx_32771.1.1 was added
[   99.164882] [drm] ring compute_32771.2.2 was added
[   99.165426] [drm] ring sdma_32771.3.3 was added
[   99.165454] [drm] ring gfx_32771.1.1 ib test pass
[   99.165484] [drm] ring compute_32771.2.2 ib test pass
[   99.165596] [drm] ring sdma_32771.3.3 ib test pass
[   99.216524] OOM killer enabled.
[   99.216527] Restarting tasks ... done.
[   99.218478] random: crng reseeded on system resumption
[   99.223935] PM: suspend exit

As you can see the nvme seems to hang and also throws an IO_PAGE_FAULT

BIOS is 3.05, Linux Kernel is “Linux 6.8.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 28 Apr 2024 15:59:47 +0000 x86_64 GNU/Linux”
64GB ADATA RAM and SN850X 4TB

Enabling or disabling ASPM through BIOS or Kernel Commandline doesn’t change anything.

@Matt_Hartley is it okay to piggyback on this thread, or should I open a new one?

Matt_Hartley · May 2, 2024, 5:21pm

Welcome to the community!

Ideally, we ask Linux users not to do this. We recommend using LUKS as it is friendly with resume.

Kellerkind · May 2, 2024, 7:33pm

Thank you very much for the welcoming!

I do already use luks + uki + secureboot and just wanted to have a master password on it, to have it locked in case of an evil maid scenario on the ESP.
I know that’s a not a very high probability to have Microsoft signing keys leaked, but why not use a feature when it’s there and use SED?
Would love to help you people debug it just in case.

whitslack · July 27, 2024, 3:47pm

The same issue affects my Framework Laptop 16 except that the hang for me is about twice as long at just over a minute. The system does apparently fully recover if I just wait, but having resumes take over a minute makes s2idle simply unusable. Fortunately there is no hang when using hibernation, but I would really appreciate being able to configure my power management to sleep (not hibernate) after a few idle minutes while running on battery.

The hang occurs on both Ubuntu 24.04 LTS and Gentoo Linux running kernel 6.6.41, so I believe the problem must lie somewhere between the kernel and the hardware.

As for Framework’s consistently advising their customers not to enable hardware disk encryption, that is such a bizarre policy stance, especially given that Framework’s own UEFI firmware fully exposes all the requisite knobs to support hardware FDE. So, they’re providing a desirable function but advising their customers not to make use of it. Okay.

James3 · July 28, 2024, 8:44am

@whitslack
This issue appears to be NVME device specific problem. So, please add which nvme device you have.
The OP fixed their problem by using a different NVME device.

whitslack · July 28, 2024, 4:05pm

I have a WD_BLACK SN770M and a WD_BLACK SN850X, both of which I purchased from Framework as parts of my Laptop 16 configuration. Does Framework offer incompatible components?

James3 · July 28, 2024, 4:30pm

I think there is a bug with the SN850X (fails to wake-up after suspend). Try to upgrade the firmware of the NVME device.

whitslack · July 29, 2024, 2:30am

I’m already on the latest firmware (as of last week anyway), but thanks for the suggestion. As I mentioned, the system does eventually recover, so I suspect there’s a timeout happening somewhere that leads to a reset code path, which suggests that a workaround in software should be possible.

Edit 2-Aug-2024: I just checked: 731120WD and 620361WD are still the latest available firmwares for the SN770M and SN850X respectively.