Experiments with using ROCM on the FW16 AMD

Mahboob_Karimian · September 24, 2025, 10:10pm

OK, now I have TheRock compiled “all” for target “gfx1103”. I see .so and header files in build/artifacts. I know these files must be in the PATH or installed as part of my Linux distribution (I used an Ubuntu 24.04 Docker container to compile), but I’m unsure if there is a script or command to automate the process. I see a Python script to archive the artifacts, but then how/where to install them?
So, to speed up my work, I now need some information to install the libs.
Also, for PyTorch, which is a vital part of the model inference, and comfyui is using it, how should I compile for gfx1103? Can I use a pre-built one from the official rocm?

Mario_Limonciello · September 25, 2025, 12:29am

Sorry I wish I could help with those questions but I’m a sideliner for the rock, I know of some things going on with it but no direct experience myself.

If their docs aren’t clear and James doesn’t know you should post a question in their GitHub.

Mahboob_Karimian · September 25, 2025, 7:45am

Thanks, I’ll wait for @James3 to comment. Especially if he has PyTorch for gfx1103, or gfx110x.

Mahboob_Karimian · September 25, 2025, 3:50pm

A report using TheRock. I replaced /opt/rocmxxx/ with build/dist/rocm for now, but this is not the standard way to do it. Anyway …
Using the latest TheRock compiled for gfx1103, I get a 719 error at the sync function when I multiply 3k matrices with rocm-rust. With my TheRock compilation, I don’t have to override any env. variable.
rocm-rust Logs:

./rocm-rust/target/release/rocm-rust 30000
init done
status of create_handle: 0
handle: 0x5afd85cfb100 
46 42 41 40 00 00 40 40 
46 42 41 40 00 00 40 40 
matrix rows: 30000 cols: 30000 size: 900000000 ram: 7200 MB
hipMalloc1: 0
hipMalloc2: 0
hipMalloc3: 0
Matrix A (input): 
2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 
00 00 00 40 00 00 00 00 
Matrix B (input): 
2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 
00 00 00 40 00 00 00 00 
Matrix C (input): 
0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 
elem_size for a complex f32: 8
elem_size for cgemm: 8
matrix rows: 30000 cols: 30000 size: 900000000 ram: 7200 MB
Start calc1 -------------------
End calc1 -------------------
Start sync -------------------
hipSyncError: 719

and our familiar dmesg log:

[95634.992586] amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[95634.992594] amdgpu 0000:64:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[95634.992597] amdgpu 0000:64:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[95634.992601] amdgpu 0000:64:00.0: amdgpu: Failed to evict queue 1
[95634.992634] amdgpu 0000:64:00.0: amdgpu: GPU reset begin!
[95634.992771] amdgpu 0000:64:00.0: amdgpu: Failed to evict process queues
[95634.992779] amdgpu: Failed to quiesce KFD
[95634.992840] amdgpu 0000:64:00.0: amdgpu: Dumping IP State
[95634.994801] amdgpu 0000:64:00.0: amdgpu: Dumping IP State Completed
[95637.040527] amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[95637.040534] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[95639.044818] amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[95639.044833] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[95639.047320] amdgpu 0000:64:00.0: amdgpu: MODE2 reset
[95639.085571] amdgpu 0000:64:00.0: amdgpu: GPU reset succeeded, trying to resume
[95639.087005] [drm] PCIE GART of 512M enabled (table at 0x000000803FD00000).
[95639.087168] amdgpu 0000:64:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[95639.087174] amdgpu 0000:64:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[95639.087183] amdgpu 0000:64:00.0: amdgpu: SMU is resuming...
[95639.090195] amdgpu 0000:64:00.0: amdgpu: SMU is resumed successfully!
[95639.097380] amdgpu 0000:64:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x08005000
[95639.377235] amdgpu 0000:64:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[95639.377242] amdgpu 0000:64:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[95639.377246] amdgpu 0000:64:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[95639.377248] amdgpu 0000:64:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[95639.377251] amdgpu 0000:64:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[95639.377253] amdgpu 0000:64:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[95639.377256] amdgpu 0000:64:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[95639.377258] amdgpu 0000:64:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[95639.377261] amdgpu 0000:64:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[95639.377264] amdgpu 0000:64:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[95639.377266] amdgpu 0000:64:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[95639.377269] amdgpu 0000:64:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[95639.377273] amdgpu 0000:64:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[95639.380346] amdgpu 0000:64:00.0: amdgpu: GPU reset(1) succeeded!
[95639.380357] amdgpu 0000:64:00.0: [drm] device wedged, but recovered through reset

James3 · September 25, 2025, 4:35pm

I see that same message.

So, at least, with using my test program, other people can reproduce the crash I see in ROCM.

Hopefully, AMD can reproduce the problem now, and fix it.

Mahboob_Karimian · September 25, 2025, 5:42pm

I also replaced the latest firmware blobs from amdgpu · main · kernel-firmware / Linux Firmware · GitLab, then did update-initramfs, but nothing has changed.

Mario_Limonciello · September 25, 2025, 7:20pm

With the updated firmware in place and the updated rock build, did you also try that LR patch I mentioned? From James’ results I don’t think it helps, but it would be good to double confirm.

And please get all this on a bug report on Github so the folks that work on this can get eyes in front of it.

James3 · September 25, 2025, 7:34pm

There is a bug raised already for this.

github.com/ROCm/TheRock

[Issue]: AMD Radeon 780M (gfx1103) hanging, debugging tips?

opened 09:06PM - 17 Aug 25 UTC

jcdutton

status: triage platform: Linux gfx110X-dgpu

### Problem Description Problem Description OS: NAME="Ubuntu" VERSION="24.04.3… LTS (Noble Numbat)" CPU: model name : AMD Ryzen 7 7840HS w/ Radeon 780M Graphics GPU: Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics Name: gfx1103 Marketing Name: AMD Radeon 780M Name: amdgcn-amd-amdhsa--gfx1103 Name: amdgcn-amd-amdhsa--gfx11-generic When the AMD 7840HS GPU crashes, how do I diagnose what is going wrong? I have a simple program that multiplies 2 matrix together using rocblas cgemm. I can run up to 30000 x 30000 OK, but when I get to 50000 x 50000 the GPU crashes. I have set the gtt/ttm to be 60 GBytes, and the 50000 x 50000 sits at about 58 GB when viewed on amdgpu_top. When it crashes, the display goes blank for about 60 seconds, but then recovers and works again, without any applications exiting. I have compiled rocm from source, probably most recent git version, with gfx1103 as the target. I don't mind if it is officially supported or not on the gfx1103. I wish to track down the cause and fix it. Please can you point me to some instruction on how to diagnose the problem on the GPU ? Note rocm 6.3.1 completed the 50000 x 50000 OK. rocm 6.3.3 failed. rocm 6.4.1 failed. rocm 7.rcX failed. (Using TheRock to build) An example program to reproduce the problem is here: [git@github.com](mailto:git@github.com):jcdutton/rocm-rust.git Operating System Ubuntu 24.04.3 LTS (Noble Numbat) CPU AMD Ryzen 7 7840HS w/ Radeon 780M Graphics GPU AMD Ryzen 7 7840HS w/ Radeon 780M Graphics ROCm Version Latest git version. Probably version 7rc something. ROCm Component rocBLAS Steps to Reproduce No response (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support No response Additional Information No response ### Operating System Ubuntu 24.04.3 LTS (Noble Numbat) ### CPU AMD Ryzen 7 7840HS w/ Radeon 780M Graphics ### GPU AMD Ryzen 7 7840HS w/ Radeon 780M Graphics ### ROCm Version Latest git version. Probably version 7rc something. ### ROCm Component _No response_ ### Steps to Reproduce _No response_ ### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support _No response_ ### Additional Information _No response_

Mahboob_Karimian · September 25, 2025, 8:16pm

This is a bit complicated because of time. Maybe I will get upstream kernel and compile on weekend and report back.

I will also report on the sane issue James opened.

Mario_Limonciello · September 25, 2025, 9:07pm

As a side note; I’m really confused by the 7840HS identifying as gfx1103. I just looked at a 7640U on my side (which should have same graphics as 7840HS) and it’s 11.0.1.


$ cat /proc/cpuinfo |grep "model name" -i | head -n1
model name      : AMD Ryzen 5 7640U w/ Radeon 760M Graphics
$ grep . /sys/class/drm/card0/device/ip_discovery/die/0/GC/0/{major,minor,revision}
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/major:11
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/minor:0
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/revision:1

But it seems like mesa is reporting the same thing as ROCm is. Very strange to me.

$ sudo DISPLAY=:0 glxinfo | grep -i device
    Device: AMD Radeon 760M (radeonsi, gfx1103_r1, LLVM 19.1.1, DRM 3.64, 6.17.0-rc7-00003-g4186d1107771) (0x15bf)

James3 · September 25, 2025, 9:23pm

I guess it would be an easy fix, if the bug is just rocm wrongly identifying the gpu!!!

grep . /sys/class/drm/card0/device/ip_discovery/die/0/GC/0/{major,minor,revision}

/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/major:11
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/minor:0
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/revision:1

Note, the 7640U has 8 GPU Cores. 760M

https://www.amd.com/en/products/processors/laptop/ryzen/7000-series/amd-ryzen-5-7640u.html

the 7840HS has 12 GPU Cores. 780M (not a 760M). It has confused me why amdgpu_top only shows 8 temp/clock values for the GPU cores. I would have thought it should display 12 temp/clock values.

https://www.amd.com/en/products/processors/laptop/ryzen/7000-series/amd-ryzen-7-7840hs.html

FW16 7840HS:

sudo glxinfo | grep -i device
Device: AMD Radeon Graphics (radeonsi, phoenix, LLVM 20.1.2, DRM 3.64, 6.16.8) (0x15bf)

amdgpu_top:

./target/debug/amdgpu_top -d

drm version: 3.64.0

Device Name : [AMD Radeon 780M Graphics]
PCI (domain:bus:dev.func): 0000:c1:00.0
DeviceID.RevID : 0x15BF.0xC2
gfx_target_version : gfx1103

rocminfo:

Agent 2

Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon 780M

Mario_Limonciello · September 25, 2025, 9:42pm

I filed this for mesa to figure out what’s going on there.

Incorrect designation of gfx version? (#13977) · Issue · mesa/mesa

Mario_Limonciello · September 26, 2025, 12:45am

I found out HW and SW version don’t always match. So 1103 is correct for software.

Mahboob_Karimian · September 26, 2025, 9:29am

@Mario_Limonciello, do you still want me to check with your patch? If it’s available upstream soon, I can wait to get the daily build from Ubuntu.

Mario_Limonciello · September 26, 2025, 11:31am

It won’t be in Ubuntu’s kernel soon; they don’t move that quick. You would need to build your own kernel with it.

But James knows what he is doing; I trust James tested it effectively.

Mahboob_Karimian · September 26, 2025, 12:59pm

Never mind, I found my old script doing patches, etc, for my mbp16 years ago, and it is still working. I’m compiling now 6.16 with your patch.

James3 · September 27, 2025, 11:24am

Hi. While I appreciate Mario’s confidence in me. I would caveat it.

If it is userspace or linux kernel, then yes, I can do it. But if it is debugging some code running on the gpu, that to me is a black box.

In userspace or the kernel, I can use gdb and friends. I don’t have any tools for the gpu side.

Then there is “TheRock”.

it has no release tags on it. I cannot force it to compile 7.0.1 version of ROCM.
it builds using the wrong location for include files. I.e. unless I actually delete include files outside of the therock tree, it only then looks within its tree for the include files. This caused compiles to fail.
it fails to build out of the box on Ubuntu 24.04.
it seems that amd developers happy add commits that break the build. Surely CI/CD should prevent that.
there are no instructions regarding which compiler and version it should be built with, or is compatible with.
the build instructions are incomplete, and provide no advice on what to do if compiles fail. In my view nothing should be committed to a main branch if it won’t compile on all the platforms it is intended for.

Mario_Limonciello · September 27, 2025, 12:02pm

Well for GPU there are ROCgdb and umr. But I don’t have experience using them myself.

James3 · September 27, 2025, 12:04pm

I guess i could try those, except I cannot even get them to compile from “therock” repo.

I only got just enough of “therock” to compile, to run my test program.

Mahboob_Karimian · September 28, 2025, 10:10am

@James3 IDK if you already did or not, but this is what I did to compile RDCgdb:
In Ubuntu 24.04, install dependencies based on https://rocm.docs.amd.com/projects/ROCgdb/en/latest/install/installation.html:

apt install bison flex gcc make ncurses-dev texinfo g++ zlib1g-dev \
libexpat-dev python3-dev liblzma-dev libgmp-dev libmpfr-dev

You may endup with dependency tree error for libmpfr-dev, if yes, do:
sudo apt install libmpfr6=4.2.1-1build1 libmpfr-dev

ROCdbgapi is required, clone
https://github.com/ROCm/ROCdbgapi
and set CMAKE_INSTALL_PREFIX to where your rocm is installed. For me, since I was using official rocm 7.0.0 before, I reused the /opt/rocm-7.0.0/ for this env. variable. (I replaced the content of this directory with theRock’s build/dists)
Then go to GitHub - ROCm/ROCdbgapi: The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD's commercially available GPU architectures. for compilation.

Then clone https://github.com/ROCm/ROCgdb and go to Installing ROCgdb — ROCgdb 16.3 Documentation for compilation. It is important to set PKG_CONFIG_PATH=/opt/rocm-7.0.0/share/pkgconfig from the previous step.

I hope this helps.
I go today to do some tests with rocgdb.

Topic		Replies	Views
Look there is now build for rocm with official support for the iGPU (780M+?) Framework Laptop 16 framework-laptop-16-amd-7040 , framework-laptop-16-amd-ai-300 , graphics-module-amd-rx7700s	9	835	January 12, 2026
Sadly, ROCm Remains VaporWare Framework Desktop	32	5367	June 8, 2025
AMD ROCm does not support the AMD Ryzen AI 300 Series GPUs Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	57	15030	December 9, 2025
ROCm on the new Framework 13 Linux bazzite	17	3879	March 15, 2026
Oss-gpt 120b large context stalls during llama.cpp checkpoints Framework Desktop ai	20	710	October 23, 2025

Experiments with using ROCM on the FW16 AMD

Related topics