Issues running PyTorch on Framework 13 GPU
I’ve been trying to run torch on my Framework 13 (AMD Ryzen™ AI 9 HX 370, Fedora Linux 43) but no matter what I try I always hit the same error:
Memory access fault by GPU node-1 (Agent handle: 0x44210e40) on address 0x7f5fe87ab000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
Specifically I have the following test program:
import math
import torch
# Validate CUDA is available
print(f"Is CUDA available? f{torch.cuda.is_available()}")
if not torch.cuda.is_available():
exit(-1)
print(f"Device name? {torch.cuda.get_device_name(0)}")
def test(device_type="cpu"):
device = torch.device(device_type)
print(f"Running test for {device_type}")
try:
x = torch.tensor([1.0, 2.0, 3.0], device=device)
y = x.sum()
print(f"Torch works on {device}: result = {y.item()}")
except Exception as e:
print(f"An error occurred while using {device}: {e}")
test("cpu")
test("cuda")
And it prints out:
Is CUDA available? fTrue
Device name? AMD Radeon Graphics
Running test for cpu
Torch works on cpu: result = 6.0
Running test for cuda
Memory access fault by GPU node-1 (Agent handle: 0xc217460) on address 0x7fcefdab0000. Reason: Page not present or supervisor privilege.
cAborted (core dumped)
This is no matter how I run it, both directly on the base system and on containers (from the rocm/pytorch repository on both podman and docker)
I tried different kernel parameters (“amdgpu.cwsr_enable=0”, “iommu=pt”, …), I’ve tried adding various flags (variations of HSA_OVERRIDE_GFX_VERSION, HSA_ENABLE_SDMA=0), I tried different versions of the rocm library (6.4.4, 7.1.*), …
My questions here are:
- Are you able to run the test program* on your linux machine (as opposed to just
torch.cuda.is_available())? If so, is your Framework running Fedora or some other OS? - Has anyone experienced the issue I’m facing?
- Does anyone have any more ideas on what to try?
*you don’t trust the code I provided above, you can also try AMD’s own examples here