Linux Stability Patch coming to kernel 6.18

AMD has pushed a fix for the instability issues on gfx1151 that all of us have been experiencing on Linux when running GenAI workloasds with llama.cpp/PyTorch/ComfyUI etc…

The fix is currently in Linus’ tree and should be released a kernel 6.18-rc1 soon: https://github.com/torvalds/linux/commit/1fb710793ce2619223adffaf981b1ff13cd48f17

I cherry-picked the commit and manually tested it with a repository I specifically created to reproduce this issue and I can confirm the issue has not surfaced again: https://github.com/kyuz0/triton_gfx1151_crashes

For everyone on Linux, update to kernel 6.18 as soon as it becomes available in your distribution.

If you are on Fedora 42 and want the fix now, you can grub the kernel I built where I backported the patch (this is what I used to test stability):

Let me know if this works for you!

8 Likes

Hi.

The patch might fix some cases, but I tried the patch, and it did not help my FW16 gfx1103 crash problems. So, it may or may not help.

This patch is specifically for gfx1151, it won’t make any difference unless you have Strix Halo.

1 Like

Well, amd asked me, a while ago, to test that patch with the gfx1103, so I assume it affects that also. But I reported to them that it did not help me.
In contrast, there is something else that does help prevent crashes on gfx1103 (7840HS)
amdgpu.cwsr_enable=0

Maybe that could help the Halo also?
Note, on the gfx1103, you still get one crash after a reboot, but then it runs OK after that.

Has anyone noticed any performance improvement with the new kernel?