Hi,
This is just a small email that might help anyone using ROCM on AMD FW13/16 laptops.
I have been having crash problems for a while, and I have finally found a work around.
It does not cover some edge cases, but for me, it is a workable solution.
To help reduce ROCM crashes. You might still get one crash after a reboot, but then after that it is fine.
There are more details here:
Specifically, with:
amdgpu.cwsr_enable=0
or a file in /etc/modprobe.d:
options amdgpu cwsr_enable=0
It seems to fix the ROCM gpu crash problems that I have observed in ROCM 6.x ans 7.x.
LLMs/AI/ML seems quite reliable now on my FW16 7840HS.
AMD will need to fix the cwsr bug, but at least we have a work around in the short term.