AMD has pushed a fix for the instability issues on gfx1151 that all of us have been experiencing on Linux when running GenAI workloasds with llama.cpp/PyTorch/ComfyUI etc…
I cherry-picked the commit and manually tested it with a repository I specifically created to reproduce this issue and I can confirm the issue has not surfaced again: https://github.com/kyuz0/triton_gfx1151_crashes
For everyone on Linux, update to kernel 6.18 as soon as it becomes available in your distribution.
If you are on Fedora 42 and want the fix now, you can grub the kernel I built where I backported the patch (this is what I used to test stability):
Well, amd asked me, a while ago, to test that patch with the gfx1103, so I assume it affects that also. But I reported to them that it did not help me.
In contrast, there is something else that does help prevent crashes on gfx1103 (7840HS)
amdgpu.cwsr_enable=0
Maybe that could help the Halo also?
Note, on the gfx1103, you still get one crash after a reboot, but then it runs OK after that.