Attn: critical bugs in amdgpu driver included with kernel 6.18.x / 6.19.x!

While doing some research for my upcoming Linux installation on my Framework 16 which arrives this week, I noticed the following kernel issue mentioned on Gentoo’s bug tracker:

It mentions that after heavy GPU workloads (in this case training on PyTorch) on the Radeon 890M (aka gfx1150, aka the internal GPU of Strix Point / HX370) the system becomes unstable.

Later in the thread the OP mentions that he has pinpointed the problem to broken CWSR (Compute Wavefront Save/Restore) which causes the MES ring buffer to become saturated, the VPE queue failing to reset, eventually leading to GPU reset loops, and system crashes and/or reboots.

He mentions that for this particular problem and use case the kernel flag:

amdgpu.cwsr_enable=0

effectively disabling CWSR is a valid workaround.

BUT (!!!)

he also mentions the following post on Phoronix:

I quote:

"[…] But when it came to testing on Linux 6.19 with the newer AMD Radeon graphics cards, that’s where things went downhill quickly. With Linux 6.19 and also confirming back on Linux 6.18, namely with the newer graphics cards they were all hitting hard hangs – some cards more quickly than others, but on none of the RDNA3 and RDNA4 GPUs could I end up with a complete run of the benchmarks.

With these newer graphics cards on Linux 6.18/6.19 there would end up being hard hangs when running different benchmarks. I had heard of some Linux gamers and users in the forums complaining of issues but not until this latest round of testing did I discover how bad and widespread this issue was with ultimately abandoning my holiday testing of the RDNA3/RDNA4 cards due to this outstanding regression.

They were hard hangs in not being able to remotely access the system and no kernel logs archived to disk. I confirmed as well with the Valve Linux graphics driver folks that they too have encountered this behavior on Linux 6.18+ with this rather show-stopping issue. So far no issue resolution from AMD nor does it look like anyone has bisected this issue yet."

As most of you (like me) are going to run, or in fact are running Linux using various RDNA 3/4 AMD GPUs it might be good to be aware of this.

SO

As Framework stated that Framework 16 will run pretty great on kernel 6.15 onwards it might be best to stick to kernel 6.15.*, 6.16.* or 6.17.* for now, until AMD and independent contributors fix the issues in the amdgpu driver included in kernel 6.18+

Also be wary of amdgpu backports because problems may arise with kernels including those aswell.

3 Likes

For 7940HS there is no probleme with kernel. all work well (I use fedora that have 6.17+ kernel)…

and even be not realy related to kernel, but to firmware. It may be the same as for StrixHalo (don’t know) but you can try : FYI linux-firmware-amdgpu 20251125 breaks rocm on AI Max 395/8060S

AMD a revert some firmware change, but for now it was not published in linux firmware release.

I wanted to post that on phoronix, be I can’t (and I did not know why)…

You might want to actually try running linux on it before you waste time researching any possible bugs lol…..There are many of us Linux users running it just fine. Besides, I have ollama disabled anyway and only enable it when I need to use it. The rest of the time I am happily playing games on mine with no ill-effect.

Once I get my DIY package and assembled things together I’m going to pull the latest kernel unstable from Portage and experiment a bit.

I just saw the Gentoo kernel team very recently started taking some patches directly from AMDGPU upstream and patching the 6.17+ distribution kernels with them, skipping waiting for vanilla sources entirely, so some stabilization on the distro level might already be in motion.

Found this post while searching for similar issues. I’m not using a Framework laptop, but I’m seeing very similar behavior.

Hardware:
Ryzen AI 9 HX 370 (Strix Point) with Radeon 890M (gfx1150), 96 GB RAM, Pop!_OS 24.04 (Ubuntu-based), kernel 6.17.x.

Symptoms:

  • Hard system freezes that end in hung_task panic / kernel panic

  • Happens on Wayland (COSMIC & GNOME) and also on Xorg

  • Sometimes the screen fills with purple noise / corruption before crashing

  • After a crash, Wi-Fi and Bluetooth stop working until a full power-off

  • Often triggered by browser usage (PDFs, banking sites, GPU-accelerated content)

  • System logs often stop before the freeze; pstore is usually empty

Debugging done (with help from others):

  • Memory test passed (no RAM errors)

  • Tried disabling panel self refresh (PSR)

  • Tried disabling PCIe power saving (ASPM)

  • Tested multiple desktop setups: GNOME and COSMIC, both Wayland and Xorg

  • Removed ROCm / GPU compute stack

  • Tested different AMD firmware versions (both distro default and newer upstream firmware)

  • Noticed occasional AMD GPU memory/page fault messages early after boot

The crashes still happen intermittently, even on Xorg, which suggests this is not compositor-specific.

This looks very similar to the recent amdgpu DCN / DMUB / Strix Point issues discussed here. I don’t have a solution yet, but I wanted to share my findings in case others run into the same problem.

2 Likes

Happy for you, but I and many others are experiencing this issue (even without any AI stuff):

It is most certainly a real concern and my computer will remain a very expensive paperweight until this is fixed.