Experiments with using ROCM on the FW16 AMD

James3 · December 22, 2024, 4:42pm

Hi,

I have been playing with rocm 6.3.1 on the FW16 without the dGPU, on Ubuntu 24.04
If anyone sees this message:

rocBLAS error: Cannot read /opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103
 List of available TensileLibrary Files : 
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1151.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
Aborted (core dumped)

This appears to work around the problem:
export HSA_OVERRIDE_GFX_VERSION=11.0.0

If anyone is interested, I am playing with large matrix multiplication in the rust programming language, with it calling out to rocm blas library. My matrix is about 90GBytes, so I will be doing batch or segmented multiplies also.
I will find out if using the iGPU for the task is quicker than using the CPU for the task on the FW16 AMD 7840HS.
My expectation is that both will be just about as fast as each other because the main problem is that the matrix does not fit in RAM and even if I had a smaller, say 20G matrix that did fit in RAM, the operation is probably memory bandwidth constrained and not compute constrained.

TheLPeink · December 22, 2024, 5:27pm

Are you using native Ubuntu or WSL?

Kenny_Lewis · December 22, 2024, 5:41pm

Nice one, and good luck. Is it too early to ask if a sparse matrix saves you from sending data that’s eventually a NOP, or if you’ve used this 90GByte dataset as an excuse to fit twin 48GiB DDR5 DIMMs?

The iGPU may only have 12 RDNA3 Compute Units, but they rate at 8.6 TFlops, single precision fused multiply-add (FMA) at 2.8GHz. If you get the drivers working for the 10TFlop ML accelerator (Windows edition, Linux amd/xdna-driver github which has been submitted for Linux Kernel 6.14), maybe that will laugh at the dimensions of your matrices and then weep that it’s not got huge memory bandwidth!

K3n.

James3 · December 22, 2024, 5:47pm

Ubuntu 24.04

James3 · December 22, 2024, 6:50pm

I have not tried Linux amd/xdna-driver.
Is the a web page detailing its capabilities and data bandwidths?
I am interested in matrix multiplication and other matrix ops with complex numbers.

James3 · December 23, 2024, 2:58am

Conclusion:
ROCM is a mess when attempted on the AMD 7840HS iGPU.
It causes general L2 protection faults. And unrecoverable gpu problem needing reboot to resolve.
It works a little with small 3x3 matrix, but larger 10240x10240 fail badly.
So, no ROCM on FW16 amd.

Kenny_Lewis · December 23, 2024, 11:28am

The NPU is an ASIC made by the team that were Xilinx, there is more detail at the link to the Windows driver in my earlier reply. I haven’t looked for a picture of its characteristics beyond the ‘10TFlops’ proclaimed throughput.

K3n.

James3 · December 23, 2024, 11:41am

Some further comments on ROCM.

This is nothing to do with FW, they don’t make the software.
The rocmblas software is really bad quality.
For example, calling “rocmblas_set_matrix(…)” with invalid parameters actual prints this out:
hip error code: 'hipErrorInvalidValue':1 at /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocBLAS/library/src/rocblas_auxiliary.cpp:569

but the return value from that function call is 0 meaning success. Go figure!!!

James3 · December 23, 2024, 3:00pm

ROCM seems to behave better on the happy path. Tested and works on a 10240x10240 f32 matrix, but you have to flatten it to a 1D vector for blas.
But any off by one allocation and ROCM fails kind of silently and badly and requires a reboot.
I will probably need to add my own shim rust functions to try and get them to do all the sanity checks before calling rocm for any semblance of rust safety.

Test results:
FW16 AMD iGPU:
ROCM BLAS: 30000 x 30000 matrix of Complex f32 values.
cgemm + sync: 139.92 seconds
RAM used: about 20GBytes.
FW16 AMD CPU:
BLAS: 30000 x 30000 matrix of Complex f32 values.
cgemm (sync not needed): 216.11 seconds.
RAM used: about 20GBytes.

Both run together:
GPU crashes and resets itself, resulting in no answer from the iGPU cgemm + sync.
CPU completes.

I have also done the same test of a few other Desktops.
Currently, my FW16 laptop, without a dGPU is the fastest (2x) computer I have in the house!!!

Old THREADRIPPER 1950 X399: 16 cores / 32 threads: 540.44 Seconds.
X570 Steel Legend AM4 3600 CPU: 6 cores / 12 threads: 308.24 seconds.
FW16: 139.92 seconds

On the 7840HS, ROCM can use up to 33554432 Kbytes of RAM when VRAM is 2048MB.
This is separate from the VRAM, so if one allocates more VRAM, ROCM has less to use.
VRAM + ROCM RAM == a fixed vale of 34GB or 35bit addressable.

Note: My FW16 has 64GB RAM chips.
So, its a bit like having a GPU with 34GB RAM.
It is not possible to get the GPU to access all the 64GB RAM.
The ROCM model with a APU is:
Allocate the memory block on the GPU.
CPU can directly read/write to that block, no copy from HOST to GPU and back again needed.
Before reading back from GPU one needs to do a GPU sync to ensure its finished its calculations before doing the read.

James3 · February 1, 2025, 11:49am

EDIT:
The 34GB RAM limit can be increased using a few simple configuration commands:
You can try increasing the GTT pool with something like:
/etc/modprobe.d/increase_amd_memory.conf

#Otherwise it's capped to only half the RAM
options amdgpu gttsize=90000 #in MB
options ttm pages_limit=22500000 #4k per page, 90GB total
options ttm page_pool_size=22500000

Note:
(gttsize * 1024) / 4.096 = ttm pages

So, if you wish to use 60GB RAM with ROCM:

options amdgpu gttsize=60000 #in MB
options ttm pages_limit=15000000 #4k per page, 60GB total
options ttm page_pool_size=15000000

Previously, with only 34GB GTT RAM, I could only do a 30000 x 30000 matrix.
Now, with 60GB GTT RAM, I can do:
40000 x 40000 matrix multiplication with complex f32 values takes:
Duration: 387.57099167s
50000 x 50000 matrix multiplication with complex f32 values takes:
Duration: 886.901997998s

James3 · February 28, 2025, 7:43pm

With ROCM 6.3.3, the 50000 x 50000 matrix multiplication fails and the GPU crashes with:
HW Exception by GPU node-1 (Agent handle: 0x5573878d36b0) reason :GPU Hang

So, ROCM being it’s nice stable self again…not!

40000 x 40000 still works on ROCM 6.3.3
Duration: 373.270316971s

Djip · March 2, 2025, 7:58pm

I try to create a backend for Llama.cpp on this APU. for now I have some success with perf but I get some aleatory crash to…

framework:~$ sudo dmesg

[20046.596355] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[20046.596360] amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[20046.596362] amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[20046.596366] amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
[20046.596368] amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
[20046.596370] amdgpu: Failed to quiesce KFD
[20046.596967] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[20046.596991] amdgpu 0000:c3:00.0: amdgpu: remove_all_kfd_queues_mes: Failed to remove queue 0 for dev 62665
[20046.597018] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
[20046.599399] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
[20049.301266] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=SUSPEND
[20049.301272] [drm:amdgpu_mes_suspend [amdgpu]] *ERROR* failed to suspend all gangs
[20049.301485] amdgpu 0000:c3:00.0: amdgpu: suspend of IP block <mes_v11_0> failed -110
[20052.291008] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[20052.291020] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[20055.400598] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[20055.400611] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[20059.670948] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[20059.670960] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[20059.673113] amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
[20059.708662] amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
[20059.709148] [drm] PCIE GART of 512M enabled (table at 0x00000081FFD00000).
[20059.709241] amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
[20059.710631] amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
[20059.716501] [drm] DMUB hardware initialized: version=0x08004800
[20059.726720] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:225
[20059.729122] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:233
[20059.731519] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:241
[20059.733920] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:249
[20060.062275] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[20060.062285] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[20060.062287] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[20060.062290] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[20060.062292] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[20060.062294] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[20060.062296] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[20060.062298] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[20060.062300] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[20060.062302] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[20060.062304] amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[20060.062306] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[20060.062308] amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[20060.065114] amdgpu 0000:c3:00.0: amdgpu: GPU reset(9) succeeded!
[20146.038164] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[20146.088249] amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
[20146.123410] amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume

In my case I have Fedora41 + hip (6.2.1) , I do not use hip/rocblas,
The crash is aleatory, and related to amdgpu: MES (ie: firmware related)

What do report dmesg in you case?

@Mario_Limonciello : Do you know where I can report this ?

James3 · March 8, 2025, 11:20pm

I see the MES errors also with the 50000 x 50000 matrix.

Djip · March 17, 2025, 12:07am

For the XDNA on FW16 you have:

2,5 Tflops with BF16 …
4x 512k on “cache”

For now I can have >4TFlops with big matmul matrice on BF16 … (in fact BF16@FP32=>FP32 …)
The CPU have a peek of 2TFlops (BF16 to…)
So I do not think we can have better perfo on XDNA; but il may have better power saving…

Djip · March 17, 2025, 12:12am

If it is reproducible can we report it somewhere ?
Note: I have more often crache if the GPU is used (Firefox/Video ???)

James3 · April 12, 2025, 10:59am

The example rust program I used to test the 50000 x 50000 32bit complex float multiplication using ROCMBLAS CGEMM.

It used to work on an older verison of ROCM, 6.3.1, but does not now on 6.4.0
It crashes with a
HW Exception by GPU node-1 (Agent handle: 0x56286fa66890) reason :GPU Hang
Aborted (core dumped)

The dmesg log has the following errors on ROCM 6.4.0, linux kernel 6.14.0:


[ 9101.922835] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9101.922842] amdgpu 0000:c1:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ 9101.922844] amdgpu 0000:c1:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 9101.922846] amdgpu 0000:c1:00.0: amdgpu: Failed to evict queue 1
[ 9101.922928] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 9101.922955] amdgpu 0000:c1:00.0: amdgpu: Failed to evict process queues
[ 9101.922957] amdgpu: Failed to quiesce KFD
[ 9101.923006] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[ 9101.925620] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[ 9105.114885] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=SUSPEND
[ 9105.114900] [drm:amdgpu_mes_suspend [amdgpu]] *ERROR* failed to suspend all gangs
[ 9105.115450] amdgpu 0000:c1:00.0: amdgpu: suspend of IP block <mes_v11_0> failed -110
[ 9108.408628] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9108.408641] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 9108.972624] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
[ 9108.972640] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[ 9108.972648] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B53
[ 9108.972655] amdgpu 0000:c1:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[ 9108.972660] amdgpu 0000:c1:00.0: amdgpu:      MORE_FAULTS: 0x1
[ 9108.972666] amdgpu 0000:c1:00.0: amdgpu:      WALKER_ERROR: 0x1
[ 9108.972670] amdgpu 0000:c1:00.0: amdgpu:      PERMISSION_FAULTS: 0x5
[ 9108.972675] amdgpu 0000:c1:00.0: amdgpu:      MAPPING_ERROR: 0x1
[ 9108.972680] amdgpu 0000:c1:00.0: amdgpu:      RW: 0x1
[ 9108.972698] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[ 9108.972705] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[ 9111.697926] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9111.697935] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 9114.984121] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9114.984130] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 9114.986411] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 9115.017434] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 9115.017929] [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
[ 9115.018125] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 9115.020249] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 9115.026811] [drm] DMUB hardware initialized: version=0x08004500
[ 9115.379435] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 9115.379446] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 9115.379450] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 9115.379454] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 9115.379457] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 9115.379460] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 9115.379463] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 9115.379466] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 9115.379469] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 9115.379472] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 9115.379475] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 9115.379479] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 9115.379482] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 9115.382346] amdgpu 0000:c1:00.0: amdgpu: GPU reset(1) succeeded!

James3 · April 18, 2025, 6:28pm

The example rust program I used to test the 50000 x 50000 32bit complex float multiplication using ROCMBLAS CGEMM.

Package: rocm
Version: 6.4.0.60400-47~24.04

Linux kernel: 6.14.2

50000 x 50000 is working on ROCM 6.4.0 and Linux kernel 6.14.2
Duration: 823.983972265s

The the MES errors seem to be fixed between Linux kernel 6.14.0 and 6.14.2

Djip · April 25, 2025, 7:28pm

I just test again with big model (a Mistral_Small with 24B param…) and bf16…
and it crache with FC42 / kernel 6.14.3 / Rocm 6.3 …

[ 2515.579021] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 2515.579026] amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ 2515.579028] amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 2515.579030] amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
[ 2515.579113] amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
[ 2515.579115] amdgpu: Failed to quiesce KFD
[ 2515.579952] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[ 2515.580011] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
[ 2515.582203] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
[ 2518.404254] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=SUSPEND
[ 2518.404260] [drm:amdgpu_mes_suspend [amdgpu]] *ERROR* failed to suspend all gangs
[ 2518.404496] amdgpu 0000:c3:00.0: amdgpu: suspend of IP block <mes_v11_0> failed -110
[ 2521.317753] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 2521.317762] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2524.612676] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 2524.612684] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2528.652420] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 2528.652428] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2528.654435] amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
[ 2528.690053] amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 2528.690550] [drm] PCIE GART of 512M enabled (table at 0x00000081FFD00000).
[ 2528.690787] amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
[ 2528.692262] amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
[ 2528.698902] [drm] DMUB hardware initialized: version=0x08004E00
[ 2528.709858] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:237
[ 2528.713112] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:245
[ 2528.716273] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:253
[ 2528.719180] amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:261
[ 2529.073091] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 2529.073100] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2529.073103] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2529.073105] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 2529.073107] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 2529.073109] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 2529.073111] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 2529.073113] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 2529.073116] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 2529.073118] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 2529.073120] amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 2529.073122] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 2529.073125] amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 2529.075533] amdgpu 0000:c3:00.0: amdgpu: GPU reset(1) succeeded!

Topic		Replies	Views
Sadly, ROCm Remains VaporWare Framework Desktop	32	3193	June 8, 2025
ROCm on the new Framework 13 Linux bazzite	15	1265	June 8, 2025
[RESPONDED] Radeon 7900XTX eGPU can't initialize in Linux (AMD 7840U Framework) Linux	28	2420	May 8, 2024
Request: verify dGPU support Framework Desktop compatibility	37	909	July 29, 2025
AMD ROCm does not support the AMD Ryzen AI 300 Series GPUs Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	19	3224	July 29, 2025

Experiments with using ROCM on the FW16 AMD

Related topics