Resolving PCIE storage instability using Linux Kernel flags

I recently got a Framework Desktop motherboard and was excited to upgrade. But I ran into some issues with my ~6 hard disk storage array and pcie-sata controller

When running several parallel heavy read / write workloads and strongly stressing the disks and controller I would get this dmesg line.

ahci 0000:c1:00.0: Using 64-bit DMA addresses

Followed immediately by corrupt sector reads seemingly randomly distributed across every disk in the array. I tried using a Marvell 88SE9215 controller and a ASM1166 controller without any improvement. From what I’ve come to understand this is the SATA controller card resetting itself mid operation.

I spent quite a bit of time chasing this as a physical issue (new card, cable replacements, power supply check etc etc) before settling in on it being a Framework Desktop issue that had nothing to do with the other hardware.

By adding this to /etc/default/grub I could stop the card from resetting and errors from occurring. Although it does break suspend.

amd_iommu=off iommu=soft pci=nomsi pci=noaer libata.force=noncq ahci.mobile_lpm_policy=0 libata.noacpi=1 pcie_aspm=off

This is a shotgun of flags I saw in various forum posts and some LLM usage. After a lot of trial and error I have found that

amd_iommu=off pci=noaer pcie_aspm=off

Resolves the issue and maintains suspend. Edit it into your etc/default/grub and run sudo grub2-mkconfig -o /boot/grub2/grub.cfg reboot and you should be good to go until Framework releases the firmware fix for this issue.

I hope this is helpful to those in a similar boat and saves you all some debugging.

Hi @DebugDan I wonder if this is related to the issues here?

Can you try with just ASPM off? If that helps it should be fixable by firmware I would expect.

I’ve been trying it with just the ASPM off flag but it didn’t work for me.

@DebugDan I tried both the simpler 3-flag approach as well as the full 8-flag approach and neither one has solved the issue for me.

In my case I am using a PCIe x4 to x16 adapter and then a riser cable, so I can’t rule out for certain that it isn’t something there that is also interfering but essentially my situation is that the GPU keeps switching between PCIe 4.0 to 1.0 to 2.0 to 3.0 to 4.0 and then back to 1.0 again.

Edit: Also found this - wondering if it may be related?

Maybe. I debated trying to use the PCIe slot with a riser to run my 7900 xtx GPU, but after days of checking all the specs and considering all of the options, I just didn’t find a configuration that I was confident it would work. So I abandoned that plan and just got an eGPU (there is another thread on this) with USB4. That has worked pretty well for me so far. I decided to use the PCIe slot for a third SSD, which is also working well. Both took some work and a little trial and error, but it was worth it.

Graphics: AMD Radeon 8060S
AMD Radeon 8060S, 98304 MB LPDDR5 SDRAM
Graphics: AMD Radeon RX 7900 XTX (Navi31 XTX) [ASRock]
AMD Radeon RX 7900 XTX, 24576 MB GDDR6 SDRAM
Drive: CT4000P310SSD8, 3907.0 GB, NVMe
Drive: ORICO, 4000.8 GB, NVMe
Drive: CT4000P310SSD8, 3907.0 GB, NVMe
Drive: Western Digital SN560E, 1953.5 GB, NVMe
OS: Microsoft Windows 11 Professional (x64) Build 26200.7462 (25H2)

Do you know if it was also jumping between Gen 4.0 and Gen 1.0 (and in between) ?

I filed a support request referencing this thread in the hope that maybe between them and the community we might find a solution.

Edit: Found this on GitHub that may be related too:

I can confirm that manually forcing the PCIe slot to Gen 3 in the BIOS was what worked for me. I was seeing similar instability where it wouldn’t negotiate properly or would hang. Even though my drive and adapter claimed Gen 4/5 compatibility, the Framework Desktop board seems very sensitive on that x4 slot. As soon as I hard-locked it to Gen 3, it worked as it should.