NVME timeout woes

Over the last few kernel upgrades (6.9+, currently on 6.10.0), I’m seeing a lot of NVME failures.

(I/O Cmd) QID 13 timeout, aborting req_op: DISCARD(3) size:17420288
nvme nvme0: I/O tag 322 (0142) opcode 0x9 (I/O Cmd) QID 13 timeout, aborting req_op: DISCARD(3) size:32862208
...
nvme nvme0: I/O tag 321 (0141) opcode 0x9 (I/O Cmd) QID 13 timeout, reset controller

My kernel config is .config · GitHub

Any ideas if this is a failing NVME or kernel issue? Also, been seeing GPU freezes too since 6.9

smart doesn’t seem to report any errors

# smartctl -a  /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.10.0-gentoo] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Predator SSD GM7000 2TB
Serial Number:                      PSBG53490901195
Firmware Version:                   3.A.J.CR
PCI Vendor/Subsystem ID:            0x1dbe
IEEE OUI Identifier:                0xa84397
Total NVM Capacity:                 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            a84397 3490901195
Local Time is:                      Mon Jul 22 10:27:23 2024 IST
Firmware Updates (0x0e):            7 Slots
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     120 Celsius
Critical Comp. Temp. Threshold:     130 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        5       5
 1 +     3.30W       -        -    1  1  1  1       50     100
 2 +     3.10W       -        -    2  2  2  2       50     200
 3 -   0.1500W       -        -    3  3  3  3      500    5000
 4 -   0.0080W       -        -    4  4  4  4     2000   85000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        50 Celsius
Available Spare:                    100%
Available Spare Threshold:          25%
Percentage Used:                    0%
Data Units Read:                    3,773,364 [1.93 TB]
Data Units Written:                 13,144,214 [6.72 TB]
Host Read Commands:                 50,262,320
Host Write Commands:                127,631,179
Controller Busy Time:               235
Power Cycles:                       434
Power On Hours:                     261
Unsafe Shutdowns:                   152
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               57 Celsius
Temperature Sensor 2:               50 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

Are you able to replicate these issues on an officially supported distro?

With the provided kernel config, yes.

If it is timing out trying to write something to the NVME storage its probably about to fail.
I would backup your data now and replace the NVME storage.

I updated the boot config with nvme_core.default_ps_max_latency_us=0 nvme_core.io_timeout=4294967295 and don’t seem to be running into the timeouts.

This is bizzare

If that does what I think it does it’ll probably also murder your battery life not letting the ssd go to lower power states.

Yep, you’re correct. Though I am not sure how else do I work around the issue. Have implemented saner defaults to

nvme_core.default_ps_max_latency_us=100 nvme_core.io_timeout=3000

but haven’t measured impact on battery life as yet.

With 100 it’ll only enter state 1 which isn’t power saving at all. Though the exit latencies on your particular ssd look astronomical compared to what I am used to seeing.

I also experienced a similar issue with kernel version 6.9.7 in the debian bookworm-backports repository. My computer became unresponsive and I couldn’t read or write from the disk. I had to hold the power button to shutdown and then I later reverted back to version 6.7.12 and had no issues since. I unfortunately have no logs since I shutdown the system before checking anything (I needed it to work now).