While doing some research for my upcoming Linux installation on my Framework 16 which arrives this week, I noticed the following kernel issue mentioned on Gentoo’s bug tracker:
It mentions that after heavy GPU workloads (in this case training on PyTorch) on the Radeon 890M (aka gfx1150, aka the internal GPU of Strix Point / HX370) the system becomes unstable.
Later in the thread the OP mentions that he has pinpointed the problem to broken CWSR (Compute Wavefront Save/Restore) which causes the MES ring buffer to become saturated, the VPE queue failing to reset, eventually leading to GPU reset loops, and system crashes and/or reboots.
He mentions that for this particular problem and use case the kernel flag:
amdgpu.cwsr_enable=0
effectively disabling CWSR is a valid workaround.
BUT (!!!)
he also mentions the following post on Phoronix:
I quote:
"[…] But when it came to testing on Linux 6.19 with the newer AMD Radeon graphics cards, that’s where things went downhill quickly. With Linux 6.19 and also confirming back on Linux 6.18, namely with the newer graphics cards they were all hitting hard hangs – some cards more quickly than others, but on none of the RDNA3 and RDNA4 GPUs could I end up with a complete run of the benchmarks.
With these newer graphics cards on Linux 6.18/6.19 there would end up being hard hangs when running different benchmarks. I had heard of some Linux gamers and users in the forums complaining of issues but not until this latest round of testing did I discover how bad and widespread this issue was with ultimately abandoning my holiday testing of the RDNA3/RDNA4 cards due to this outstanding regression.
They were hard hangs in not being able to remotely access the system and no kernel logs archived to disk. I confirmed as well with the Valve Linux graphics driver folks that they too have encountered this behavior on Linux 6.18+ with this rather show-stopping issue. So far no issue resolution from AMD nor does it look like anyone has bisected this issue yet."
As most of you (like me) are going to run, or in fact are running Linux using various RDNA 3/4 AMD GPUs it might be good to be aware of this.
SO
As Framework stated that Framework 16 will run pretty great on kernel 6.15 onwards it might be best to stick to kernel 6.15.*, 6.16.* or 6.17.* for now, until AMD and independent contributors fix the issues in the amdgpu driver included in kernel 6.18+
Also be wary of amdgpu backports because problems may arise with kernels including those aswell.