NVME timeout woes

Amit_Ambasta · July 22, 2024, 4:56am

Over the last few kernel upgrades (6.9+, currently on 6.10.0), I’m seeing a lot of NVME failures.

(I/O Cmd) QID 13 timeout, aborting req_op: DISCARD(3) size:17420288
nvme nvme0: I/O tag 322 (0142) opcode 0x9 (I/O Cmd) QID 13 timeout, aborting req_op: DISCARD(3) size:32862208
...
nvme nvme0: I/O tag 321 (0141) opcode 0x9 (I/O Cmd) QID 13 timeout, reset controller

My kernel config is .config · GitHub

Any ideas if this is a failing NVME or kernel issue? Also, been seeing GPU freezes too since 6.9

smart doesn’t seem to report any errors

# smartctl -a  /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.10.0-gentoo] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Predator SSD GM7000 2TB
Serial Number:                      PSBG53490901195
Firmware Version:                   3.A.J.CR
PCI Vendor/Subsystem ID:            0x1dbe
IEEE OUI Identifier:                0xa84397
Total NVM Capacity:                 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            a84397 3490901195
Local Time is:                      Mon Jul 22 10:27:23 2024 IST
Firmware Updates (0x0e):            7 Slots
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     120 Celsius
Critical Comp. Temp. Threshold:     130 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        5       5
 1 +     3.30W       -        -    1  1  1  1       50     100
 2 +     3.10W       -        -    2  2  2  2       50     200
 3 -   0.1500W       -        -    3  3  3  3      500    5000
 4 -   0.0080W       -        -    4  4  4  4     2000   85000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        50 Celsius
Available Spare:                    100%
Available Spare Threshold:          25%
Percentage Used:                    0%
Data Units Read:                    3,773,364 [1.93 TB]
Data Units Written:                 13,144,214 [6.72 TB]
Host Read Commands:                 50,262,320
Host Write Commands:                127,631,179
Controller Busy Time:               235
Power Cycles:                       434
Power On Hours:                     261
Unsafe Shutdowns:                   152
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               57 Celsius
Temperature Sensor 2:               50 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

TheTRUEAsian · July 22, 2024, 2:48pm

Are you able to replicate these issues on an officially supported distro?

Amit_Ambasta · July 22, 2024, 7:33pm

With the provided kernel config, yes.

James3 · July 22, 2024, 7:42pm

If it is timing out trying to write something to the NVME storage its probably about to fail.
I would backup your data now and replace the NVME storage.

Amit_Ambasta · July 23, 2024, 12:44am

I updated the boot config with nvme_core.default_ps_max_latency_us=0 nvme_core.io_timeout=4294967295 and don’t seem to be running into the timeouts.

This is bizzare

Adrian_Joachim · July 23, 2024, 6:55am

If that does what I think it does it’ll probably also murder your battery life not letting the ssd go to lower power states.

Amit_Ambasta · July 23, 2024, 7:41am

Yep, you’re correct. Though I am not sure how else do I work around the issue. Have implemented saner defaults to

nvme_core.default_ps_max_latency_us=100 nvme_core.io_timeout=3000

but haven’t measured impact on battery life as yet.

Adrian_Joachim · July 23, 2024, 9:08am

With 100 it’ll only enter state 1 which isn’t power saving at all. Though the exit latencies on your particular ssd look astronomical compared to what I am used to seeing.

Chris_J · July 23, 2024, 9:39pm

I also experienced a similar issue with kernel version 6.9.7 in the debian bookworm-backports repository. My computer became unresponsive and I couldn’t read or write from the disk. I had to hold the power button to shutdown and then I later reverted back to version 6.7.12 and had no issues since. I unfortunately have no logs since I shutdown the system before checking anything (I needed it to work now).

Topic		Replies	Views
[SOLVED] Various issues of 12th gen with Void Linux Linux	18	5125	September 28, 2024
FW13: random NVME I/O errors with Debian Bookworm Linux debian	5	118	January 19, 2025
[RESPONDED] Laptop Freezes Linux	1	497	May 4, 2023
Heads up: NVMe dissapearing after resume on 6.9.0-rcX and AMD Framework 13 Community Support	1	274	May 7, 2024
NVME dissapears randomly - AMD Framework 13" / Linux Linux	10	883	May 24, 2024

NVME timeout woes

Related topics