FW13: random NVME I/O errors with Debian Bookworm

Hello everybody,

I am using my Framework 13th gen laptop since a few weeks now and randomly run into I/O errors with Debian Bookworm:


[Sat Jan  4 15:28:11 2025] nvme nvme0: request 0x387 genctr mismatch (got 0xf expected 0x3)
[Sat Jan  4 15:28:11 2025] nvme nvme0: invalid id 62343 completed on queue 4
[Sat Jan  4 15:28:41 2025] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x11
[Sat Jan  4 15:28:41 2025] nvme0n1: I/O Cmd(0x2) @ LBA 1683824856, 256 blocks, I/O Error (sct 0x3 / sc 0x71) 
[Sat Jan  4 15:28:41 2025] I/O error, dev nvme0n1, sector 1683824856 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
[Sat Jan  4 15:28:41 2025] nvme0n1: I/O Cmd(0x2) @ LBA 1673488976, 256 blocks, I/O Error (sct 0x3 / sc 0x71) 
[Sat Jan  4 15:28:41 2025] I/O error, dev nvme0n1, sector 1673488976 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
[Sat Jan  4 15:28:41 2025] nvme0n1: I/O Cmd(0x2) @ LBA 1584020760, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
[Sat Jan  4 15:28:41 2025] I/O error, dev nvme0n1, sector 1584020760 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[Sat Jan  4 15:28:41 2025] nvme0n1: I/O Cmd(0x2) @ LBA 631909664, 256 blocks, I/O Error (sct 0x3 / sc 0x71) 
[Sat Jan  4 15:28:41 2025] I/O error, dev nvme0n1, sector 631909664 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
[Sat Jan  4 15:28:41 2025] nvme0n1: I/O Cmd(0x2) @ LBA 1671510392, 32 blocks, I/O Error (sct 0x3 / sc 0x71) 
[Sat Jan  4 15:28:41 2025] I/O error, dev nvme0n1, sector 1671510392 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
[Sat Jan  4 15:28:41 2025] nvme0n1: I/O Cmd(0x2) @ LBA 1694509688, 64 blocks, I/O Error (sct 0x3 / sc 0x71) 
[Sat Jan  4 15:28:41 2025] I/O error, dev nvme0n1, sector 1694509688 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2

The strange thing is, it doesn’t happen every day - but often enough that it is really annoying. Every time that happens the OS freezes for a few seconds until it responds again. After one of those freezes, I ran dmesg and saw those I/O errors (see output above).

The NVME drive is new and only a few weeks old (bought it off Amazon when I bought the framework laptop).

I already ran an extended nvme-cli health check, but according to that everything is fine:

Device Self Test Log for NVME device:nvme0
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x131
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x131
  Vendor Specific              : 0 0

BIOS Version: 03.05

Anyone an idea what could be the issue here? Is my NVME (slowly) dying although the diagnostics still report it as being fine?

Thanks!

Bernhard

Hi. It might be the hardware failing, but I am not 100% sure.
Do you get any clues from:
nvme smart-log /dev/nvme0n1
nvme error-log /dev/nvme0n1

Thanks for your reply!

I don’t see anything suspicious in the outputs.

The output of nvme smart-log /dev/nvme0n1:

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 38°C (311 Kelvin)
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 0%
endurance group critical warning summary: 0
Data Units Read				: 510,809 (261.53 GB)
Data Units Written			: 6,391,475 (3.27 TB)
host_read_commands			: 7,163,041
host_write_commands			: 27,384,190
controller_busy_time			: 178
power_cycles				: 129
power_on_hours				: 348
unsafe_shutdowns			: 5
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Temperature Sensor 1           : 38°C (311 Kelvin)
Temperature Sensor 2           : 38°C (311 Kelvin)
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0

(needed to forcibly shutdown the system a handful of times due to a complete freeze, I guess that’s where the unsafe shutdowns are coming from)

The output of nvme error-log /dev/nvme0n1:

.................
 Entry[ 0]   
.................
error_count	: 0
sqid		: 0
cmdid		: 0
status_field	: 0(Successful Completion: The command completed without error)
phase_tag	: 0
parm_err_loc	: 0
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

I am getting 64 of those entries - all with the same output.

Bernhard

I guess the first step would be, check you have a backup of everything you need.
Then, maybe try removing the NVME device, cleaning the connections, and then installing it again.
The errors seem more linked to a problem with the communications with the NVME device, than the actual storage of data. That is why I suggested cleaning the connections.
It could also be an intermittent hardware fault on the control chip on the NVME device, in which case, you probably should replace the NVME device with a new one.
Lastly, if could be a fault with the FW motherboard, but I can’t think of any way to test that apart from introducing a known good NVME device (i.e. a new one).

Thanks a lot!

I replaced my NVME drive with a new one now - so far no new errors. But it’s only a day, so let’s see if that finally fixed the issue :slight_smile:

Some NVME drives have long 5 year warranty. So, you might be able to get it replaced for free.