Over the last few kernel upgrades (6.9+, currently on 6.10.0), I’m seeing a lot of NVME failures.
(I/O Cmd) QID 13 timeout, aborting req_op: DISCARD(3) size:17420288
nvme nvme0: I/O tag 322 (0142) opcode 0x9 (I/O Cmd) QID 13 timeout, aborting req_op: DISCARD(3) size:32862208
...
nvme nvme0: I/O tag 321 (0141) opcode 0x9 (I/O Cmd) QID 13 timeout, reset controller
My kernel config is .config · GitHub
Any ideas if this is a failing NVME or kernel issue? Also, been seeing GPU freezes too since 6.9
smart doesn’t seem to report any errors
# smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.10.0-gentoo] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Predator SSD GM7000 2TB
Serial Number: PSBG53490901195
Firmware Version: 3.A.J.CR
PCI Vendor/Subsystem ID: 0x1dbe
IEEE OUI Identifier: 0xa84397
Total NVM Capacity: 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity: 0
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: a84397 3490901195
Local Time is: Mon Jul 22 10:27:23 2024 IST
Firmware Updates (0x0e): 7 Slots
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 120 Celsius
Critical Comp. Temp. Threshold: 130 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 5 5
1 + 3.30W - - 1 1 1 1 50 100
2 + 3.10W - - 2 2 2 2 50 200
3 - 0.1500W - - 3 3 3 3 500 5000
4 - 0.0080W - - 4 4 4 4 2000 85000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 50 Celsius
Available Spare: 100%
Available Spare Threshold: 25%
Percentage Used: 0%
Data Units Read: 3,773,364 [1.93 TB]
Data Units Written: 13,144,214 [6.72 TB]
Host Read Commands: 50,262,320
Host Write Commands: 127,631,179
Controller Busy Time: 235
Power Cycles: 434
Power On Hours: 261
Unsafe Shutdowns: 152
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 57 Celsius
Temperature Sensor 2: 50 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged