Issues with PyTorch on Framework 13 GPU

Issues running PyTorch on Framework 13 GPU

I’ve been trying to run torch on my Framework 13 (AMD Ryzen™ AI 9 HX 370, Fedora Linux 43) but no matter what I try I always hit the same error:

Memory access fault by GPU node-1 (Agent handle: 0x44210e40) on address 0x7f5fe87ab000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Specifically I have the following test program:

import math
import torch

# Validate CUDA is available
print(f"Is CUDA available? f{torch.cuda.is_available()}")
if not torch.cuda.is_available():
    exit(-1)
print(f"Device name? {torch.cuda.get_device_name(0)}")

def test(device_type="cpu"):
    device = torch.device(device_type)
    print(f"Running test for {device_type}")
    try:
        x = torch.tensor([1.0, 2.0, 3.0], device=device)
        y = x.sum()
        print(f"Torch works on {device}: result = {y.item()}")
    except Exception as e:
        print(f"An error occurred while using {device}: {e}")

test("cpu")
test("cuda")

And it prints out:

Is CUDA available? fTrue
Device name? AMD Radeon Graphics
Running test for cpu
Torch works on cpu: result = 6.0
Running test for cuda
Memory access fault by GPU node-1 (Agent handle: 0xc217460) on address 0x7fcefdab0000. Reason: Page not present or supervisor privilege.
cAborted (core dumped)

This is no matter how I run it, both directly on the base system and on containers (from the rocm/pytorch repository on both podman and docker)

I tried different kernel parameters (“amdgpu.cwsr_enable=0”, “iommu=pt”, …), I’ve tried adding various flags (variations of HSA_OVERRIDE_GFX_VERSION, HSA_ENABLE_SDMA=0), I tried different versions of the rocm library (6.4.4, 7.1.*), …

My questions here are:

  • Are you able to run the test program* on your linux machine (as opposed to just torch.cuda.is_available())? If so, is your Framework running Fedora or some other OS?
  • Has anyone experienced the issue I’m facing?
  • Does anyone have any more ideas on what to try?

*you don’t trust the code I provided above, you can also try AMD’s own examples here

It seems to be related to this issue on kernel 6.17.9. FYI linux-firmware-amdgpu 20251125 breaks rocm on AI Max 395/8060S

3 Likes

Thank you @Gilberto_Tin ! You’re spot on. I reverted back to 6.17.8 (by selecting it on boot / grub) and it worked!

1 Like

Hi -

Facing the same issue with pytorch / rocm. Rollback to kernel 6.17.8 seems to have fixed pytorch for me.

(Previously included bluetooth problem that was later determined to not be caused by the rollback hence removed)

Thanks,

Ryan

The revert is upstream already. It was requested here for fedora but no one with permissions in fedora has done anything.

Does this issue just impact the AMD Ryzen AI 9 HX 370 (which I think has the 860m iGPU), or does it also impact the Ryzen 7 7840U (with the 780M iGPU)? I’m asking because the linked bugzilla thread only mentions Strix Point/Halo and gfx1151.

I’m on a the 780m (gfx1103) and trying to rule things out before I try reverting to an earlier kernel.

It would affect all.