OK, now I have TheRock compiled “all” for target “gfx1103”. I see .so and header files in build/artifacts. I know these files must be in the PATH or installed as part of my Linux distribution (I used an Ubuntu 24.04 Docker container to compile), but I’m unsure if there is a script or command to automate the process. I see a Python script to archive the artifacts, but then how/where to install them?
So, to speed up my work, I now need some information to install the libs.
Also, for PyTorch, which is a vital part of the model inference, and comfyui is using it, how should I compile for gfx1103? Can I use a pre-built one from the official rocm?
Sorry I wish I could help with those questions but I’m a sideliner for the rock, I know of some things going on with it but no direct experience myself.
If their docs aren’t clear and James doesn’t know you should post a question in their GitHub.
Thanks, I’ll wait for @James3 to comment. Especially if he has PyTorch for gfx1103, or gfx110x.
A report using TheRock. I replaced /opt/rocmxxx/ with build/dist/rocm for now, but this is not the standard way to do it. Anyway …
Using the latest TheRock compiled for gfx1103, I get a 719 error at the sync function when I multiply 3k matrices with rocm-rust. With my TheRock compilation, I don’t have to override any env. variable.
rocm-rust Logs:
./rocm-rust/target/release/rocm-rust 30000
init done
status of create_handle: 0
handle: 0x5afd85cfb100
46 42 41 40 00 00 40 40
46 42 41 40 00 00 40 40
matrix rows: 30000 cols: 30000 size: 900000000 ram: 7200 MB
hipMalloc1: 0
hipMalloc2: 0
hipMalloc3: 0
Matrix A (input):
2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i
00 00 00 40 00 00 00 00
Matrix B (input):
2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i 2+0i
00 00 00 40 00 00 00 00
Matrix C (input):
0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i
elem_size for a complex f32: 8
elem_size for cgemm: 8
matrix rows: 30000 cols: 30000 size: 900000000 ram: 7200 MB
Start calc1 -------------------
End calc1 -------------------
Start sync -------------------
hipSyncError: 719
and our familiar dmesg log:
[95634.992586] amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[95634.992594] amdgpu 0000:64:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[95634.992597] amdgpu 0000:64:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[95634.992601] amdgpu 0000:64:00.0: amdgpu: Failed to evict queue 1
[95634.992634] amdgpu 0000:64:00.0: amdgpu: GPU reset begin!
[95634.992771] amdgpu 0000:64:00.0: amdgpu: Failed to evict process queues
[95634.992779] amdgpu: Failed to quiesce KFD
[95634.992840] amdgpu 0000:64:00.0: amdgpu: Dumping IP State
[95634.994801] amdgpu 0000:64:00.0: amdgpu: Dumping IP State Completed
[95637.040527] amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[95637.040534] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[95639.044818] amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[95639.044833] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[95639.047320] amdgpu 0000:64:00.0: amdgpu: MODE2 reset
[95639.085571] amdgpu 0000:64:00.0: amdgpu: GPU reset succeeded, trying to resume
[95639.087005] [drm] PCIE GART of 512M enabled (table at 0x000000803FD00000).
[95639.087168] amdgpu 0000:64:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[95639.087174] amdgpu 0000:64:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[95639.087183] amdgpu 0000:64:00.0: amdgpu: SMU is resuming...
[95639.090195] amdgpu 0000:64:00.0: amdgpu: SMU is resumed successfully!
[95639.097380] amdgpu 0000:64:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x08005000
[95639.377235] amdgpu 0000:64:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[95639.377242] amdgpu 0000:64:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[95639.377246] amdgpu 0000:64:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[95639.377248] amdgpu 0000:64:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[95639.377251] amdgpu 0000:64:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[95639.377253] amdgpu 0000:64:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[95639.377256] amdgpu 0000:64:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[95639.377258] amdgpu 0000:64:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[95639.377261] amdgpu 0000:64:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[95639.377264] amdgpu 0000:64:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[95639.377266] amdgpu 0000:64:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[95639.377269] amdgpu 0000:64:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[95639.377273] amdgpu 0000:64:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[95639.380346] amdgpu 0000:64:00.0: amdgpu: GPU reset(1) succeeded!
[95639.380357] amdgpu 0000:64:00.0: [drm] device wedged, but recovered through reset
I see that same message.
So, at least, with using my test program, other people can reproduce the crash I see in ROCM.
Hopefully, AMD can reproduce the problem now, and fix it.
I also replaced the latest firmware blobs from amdgpu · main · kernel-firmware / Linux Firmware · GitLab, then did update-initramfs, but nothing has changed.
With the updated firmware in place and the updated rock build, did you also try that LR patch I mentioned? From James’ results I don’t think it helps, but it would be good to double confirm.
And please get all this on a bug report on Github so the folks that work on this can get eyes in front of it.
There is a bug raised already for this.
This is a bit complicated because of time. Maybe I will get upstream kernel and compile on weekend and report back.
I will also report on the sane issue James opened.
As a side note; I’m really confused by the 7840HS identifying as gfx1103. I just looked at a 7640U on my side (which should have same graphics as 7840HS) and it’s 11.0.1.
$ cat /proc/cpuinfo |grep "model name" -i | head -n1
model name : AMD Ryzen 5 7640U w/ Radeon 760M Graphics
$ grep . /sys/class/drm/card0/device/ip_discovery/die/0/GC/0/{major,minor,revision}
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/major:11
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/minor:0
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/revision:1
But it seems like mesa is reporting the same thing as ROCm is. Very strange to me.
$ sudo DISPLAY=:0 glxinfo | grep -i device
Device: AMD Radeon 760M (radeonsi, gfx1103_r1, LLVM 19.1.1, DRM 3.64, 6.17.0-rc7-00003-g4186d1107771) (0x15bf)
I guess it would be an easy fix, if the bug is just rocm wrongly identifying the gpu!!!
grep . /sys/class/drm/card0/device/ip_discovery/die/0/GC/0/{major,minor,revision}
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/major:11
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/minor:0
/sys/class/drm/card0/device/ip_discovery/die/0/GC/0/revision:1
Note, the 7640U has 8 GPU Cores. 760M
https://www.amd.com/en/products/processors/laptop/ryzen/7000-series/amd-ryzen-5-7640u.html
the 7840HS has 12 GPU Cores. 780M (not a 760M). It has confused me why amdgpu_top only shows 8 temp/clock values for the GPU cores. I would have thought it should display 12 temp/clock values.
https://www.amd.com/en/products/processors/laptop/ryzen/7000-series/amd-ryzen-7-7840hs.html
FW16 7840HS:
sudo glxinfo | grep -i device
Device: AMD Radeon Graphics (radeonsi, phoenix, LLVM 20.1.2, DRM 3.64, 6.16.8) (0x15bf)
amdgpu_top:
./target/debug/amdgpu_top -d
drm version: 3.64.0
Device Name : [AMD Radeon 780M Graphics]
PCI (domain:bus:dev.func): 0000:c1:00.0
DeviceID.RevID : 0x15BF.0xC2
gfx_target_version : gfx1103
rocminfo:
Agent 2
Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon 780M
I filed this for mesa to figure out what’s going on there.
Incorrect designation of gfx version? (#13977) · Issue · mesa/mesa
I found out HW and SW version don’t always match. So 1103 is correct for software.
@Mario_Limonciello, do you still want me to check with your patch? If it’s available upstream soon, I can wait to get the daily build from Ubuntu.
It won’t be in Ubuntu’s kernel soon; they don’t move that quick. You would need to build your own kernel with it.
But James knows what he is doing; I trust James tested it effectively.
Never mind, I found my old script doing patches, etc, for my mbp16 years ago, and it is still working. I’m compiling now 6.16 with your patch.
Hi. While I appreciate Mario’s confidence in me. I would caveat it.
If it is userspace or linux kernel, then yes, I can do it. But if it is debugging some code running on the gpu, that to me is a black box.
In userspace or the kernel, I can use gdb and friends. I don’t have any tools for the gpu side.
Then there is “TheRock”.
-
it has no release tags on it. I cannot force it to compile 7.0.1 version of ROCM.
-
it builds using the wrong location for include files. I.e. unless I actually delete include files outside of the therock tree, it only then looks within its tree for the include files. This caused compiles to fail.
-
it fails to build out of the box on Ubuntu 24.04.
-
it seems that amd developers happy add commits that break the build. Surely CI/CD should prevent that.
-
there are no instructions regarding which compiler and version it should be built with, or is compatible with.
-
the build instructions are incomplete, and provide no advice on what to do if compiles fail. In my view nothing should be committed to a main branch if it won’t compile on all the platforms it is intended for.
Well for GPU there are ROCgdb and umr. But I don’t have experience using them myself.
I guess i could try those, except I cannot even get them to compile from “therock” repo.
I only got just enough of “therock” to compile, to run my test program.
@James3 IDK if you already did or not, but this is what I did to compile RDCgdb:
In Ubuntu 24.04, install dependencies based on https://rocm.docs.amd.com/projects/ROCgdb/en/latest/install/installation.html:
apt install bison flex gcc make ncurses-dev texinfo g++ zlib1g-dev \
libexpat-dev python3-dev liblzma-dev libgmp-dev libmpfr-dev
You may endup with dependency tree error for libmpfr-dev, if yes, do:
sudo apt install libmpfr6=4.2.1-1build1 libmpfr-dev
ROCdbgapi is required, clone
https://github.com/ROCm/ROCdbgapi
and set CMAKE_INSTALL_PREFIX to where your rocm is installed. For me, since I was using official rocm 7.0.0 before, I reused the /opt/rocm-7.0.0/ for this env. variable. (I replaced the content of this directory with theRock’s build/dists)
Then go to GitHub - ROCm/ROCdbgapi: The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD's commercially available GPU architectures. for compilation.
Then clone https://github.com/ROCm/ROCgdb and go to Installing ROCgdb — ROCgdb 16.3 Documentation for compilation. It is important to set PKG_CONFIG_PATH=/opt/rocm-7.0.0/share/pkgconfig from the previous step.
I hope this helps.
I go today to do some tests with rocgdb.