So pre-emption is how you debug these kinds of problems (you would use something like ROCgdb). If you turn off CWSR then you can no longer pre-empt workloads. I am worried that turning off CWSR actually could be causing the issue, but let’s see.
Moving it to CPU will be a very expensive hit to performance, but sometimes helps with these types of problems too.