FW16 ROCm?

I’m trying to get ROCm working with FW16 HX370 with the iGPU 890m. Using Fedora 43 I’ve tried the distros 6.4.4 version + torch as well as the amd 7.2 + amd’s torch+rocm7.2 build.

When I run either on the CPU with huggingface diffuser library, generation works. When I set device_map “cuda” to get the ROCm GPU support, it will process each iteration on the GPU and then once it reaches 100% just freezes and sometimes after a while the GPU will reset.

If I run a pytorch tensor calc on the GPU that works, but it seems anything more demanding fails.

I believe gfx1150 support was only added in the latest 7.2 release? Has anyone here successfully setup ROCm, if so which versions (from pytorch or amd repo?) did you use of ROCm / pytorch and on which distro?

1 Like

Are you able to capture any logs from the crash and post them here.
We kind of need to know what sort of crash it is, before being able to help at all. So, something like a stack trace from dmesg would help.

I have noticed if I boot the kernel with amd.cwsr_enable=0 and drop the width/height of generation from 1024 to 512 it will successfully generate in about 3-5 seconds. Removing that kernel param or increasing to 1024 and it will generate to 100% complete but then remain there with the GPU at 100% usage.

Sometimes the GPU will fully reset sometimes the entire machine resets.

I’ll see what I can find in the logs and post.

Logs of the stack/crash are too large to include here. I’ve uploaded to dropbox:

journalctl with stack: Dropbox

rocminfo: Dropbox

pip list: Dropbox

kernel version: 6.17.1-300.fc43.x86_64 also same on 6.18.7-200.fc43.x86_64

I’m leaning towards amgdpu issue but I don’t want to rule out my setup as a cause either. I’ve tried various combinations including Fedora packaged 6.4.4 version of rocm + pytorch from amd and pytoch repos built with rocm. Also tried installing 7.2 rocm by adding the amd repo for rocm/amdgraphics and installing those plus using torch+rocm7.2 from amd.

In all cases where cuda.is_available reports True, the halt/gpu reset happens at 100%.

This is for FW16 hx370 890m GPU (gfx1150).

Anything else I could run/try that might provide further log details?