I have been playing with rocm 6.3.1 on the FW16 without the dGPU, on Ubuntu 24.04
If anyone sees this message:
rocBLAS error: Cannot read /opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103
List of available TensileLibrary Files :
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1151.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
Aborted (core dumped)
This appears to work around the problem:
export HSA_OVERRIDE_GFX_VERSION=11.0.0
If anyone is interested, I am playing with large matrix multiplication in the rust programming language, with it calling out to rocm blas library. My matrix is about 90GBytes, so I will be doing batch or segmented multiplies also.
I will find out if using the iGPU for the task is quicker than using the CPU for the task on the FW16 AMD 7840HS.
My expectation is that both will be just about as fast as each other because the main problem is that the matrix does not fit in RAM and even if I had a smaller, say 20G matrix that did fit in RAM, the operation is probably memory bandwidth constrained and not compute constrained.
Nice one, and good luck. Is it too early to ask if a sparse matrix saves you from sending data that’s eventually a NOP, or if you’ve used this 90GByte dataset as an excuse to fit twin 48GiB DDR5 DIMMs?
The iGPU may only have 12 RDNA3 Compute Units, but they rate at 8.6 TFlops, single precision fused multiply-add (FMA) at 2.8GHz. If you get the drivers working for the 10TFlop ML accelerator (Windows edition, Linux amd/xdna-driver github which has been submitted for Linux Kernel 6.14), maybe that will laugh at the dimensions of your matrices and then weep that it’s not got huge memory bandwidth!
I have not tried Linux amd/xdna-driver.
Is the a web page detailing its capabilities and data bandwidths?
I am interested in matrix multiplication and other matrix ops with complex numbers.
Conclusion:
ROCM is a mess when attempted on the AMD 7840HS iGPU.
It causes general L2 protection faults. And unrecoverable gpu problem needing reboot to resolve.
It works a little with small 3x3 matrix, but larger 10240x10240 fail badly.
So, no ROCM on FW16 amd.
The NPU is an ASIC made by the team that were Xilinx, there is more detail at the link to the Windows driver in my earlier reply. I haven’t looked for a picture of its characteristics beyond the ‘10TFlops’ proclaimed throughput.
This is nothing to do with FW, they don’t make the software.
The rocmblas software is really bad quality.
For example, calling “rocmblas_set_matrix(…)” with invalid parameters actual prints this out: hip error code: 'hipErrorInvalidValue':1 at /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocBLAS/library/src/rocblas_auxiliary.cpp:569
but the return value from that function call is 0 meaning success. Go figure!!!
ROCM seems to behave better on the happy path. Tested and works on a 10240x10240 f32 matrix, but you have to flatten it to a 1D vector for blas.
But any off by one allocation and ROCM fails kind of silently and badly and requires a reboot.
I will probably need to add my own shim rust functions to try and get them to do all the sanity checks before calling rocm for any semblance of rust safety.
Test results:
FW16 AMD iGPU:
ROCM BLAS: 30000 x 30000 matrix of Complex f32 values.
cgemm + sync: 139.92 seconds
RAM used: about 20GBytes.
FW16 AMD CPU:
BLAS: 30000 x 30000 matrix of Complex f32 values.
cgemm (sync not needed): 216.11 seconds.
RAM used: about 20GBytes.
Both run together:
GPU crashes and resets itself, resulting in no answer from the iGPU cgemm + sync.
CPU completes.
I have also done the same test of a few other Desktops.
Currently, my FW16 laptop, without a dGPU is the fastest (2x) computer I have in the house!!!
On the 7840HS, ROCM can use up to 33554432 Kbytes of RAM when VRAM is 2048MB.
This is separate from the VRAM, so if one allocates more VRAM, ROCM has less to use.
VRAM + ROCM RAM == a fixed vale of 34GB or 35bit addressable.
Note: My FW16 has 64GB RAM chips.
So, its a bit like having a GPU with 34GB RAM.
It is not possible to get the GPU to access all the 64GB RAM.
The ROCM model with a APU is:
Allocate the memory block on the GPU.
CPU can directly read/write to that block, no copy from HOST to GPU and back again needed.
Before reading back from GPU one needs to do a GPU sync to ensure its finished its calculations before doing the read.
EDIT:
The 34GB RAM limit can be increased using a few simple configuration commands:
You can try increasing the GTT pool with something like:
/etc/modprobe.d/increase_amd_memory.conf
#Otherwise it's capped to only half the RAM
options amdgpu gttsize=90000 #in MB
options ttm pages_limit=22500000 #4k per page, 90GB total
options ttm page_pool_size=22500000
Note:
(gttsize * 1024) / 4.096 = ttm pages
So, if you wish to use 60GB RAM with ROCM:
options amdgpu gttsize=60000 #in MB
options ttm pages_limit=15000000 #4k per page, 60GB total
options ttm page_pool_size=15000000
Previously, with only 34GB GTT RAM, I could only do a 30000 x 30000 matrix.
Now, with 60GB GTT RAM, I can do:
40000 x 40000 matrix multiplication with complex f32 values takes:
Duration: 387.57099167s
50000 x 50000 matrix multiplication with complex f32 values takes:
Duration: 886.901997998s
With ROCM 6.3.3, the 50000 x 50000 matrix multiplication fails and the GPU crashes with:
HW Exception by GPU node-1 (Agent handle: 0x5573878d36b0) reason :GPU Hang
So, ROCM being it’s nice stable self again…not!
40000 x 40000 still works on ROCM 6.3.3
Duration: 373.270316971s
For now I can have >4TFlops with big matmul matrice on BF16 … (in fact BF16@FP32=>FP32 …)
The CPU have a peek of 2TFlops (BF16 to…)
So I do not think we can have better perfo on XDNA; but il may have better power saving…
The example rust program I used to test the 50000 x 50000 32bit complex float multiplication using ROCMBLAS CGEMM.
It used to work on an older verison of ROCM, 6.3.1, but does not now on 6.4.0
It crashes with a
HW Exception by GPU node-1 (Agent handle: 0x56286fa66890) reason :GPU Hang
Aborted (core dumped)
The dmesg log has the following errors on ROCM 6.4.0, linux kernel 6.14.0:
[ 9101.922835] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9101.922842] amdgpu 0000:c1:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ 9101.922844] amdgpu 0000:c1:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 9101.922846] amdgpu 0000:c1:00.0: amdgpu: Failed to evict queue 1
[ 9101.922928] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 9101.922955] amdgpu 0000:c1:00.0: amdgpu: Failed to evict process queues
[ 9101.922957] amdgpu: Failed to quiesce KFD
[ 9101.923006] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[ 9101.925620] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[ 9105.114885] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=SUSPEND
[ 9105.114900] [drm:amdgpu_mes_suspend [amdgpu]] *ERROR* failed to suspend all gangs
[ 9105.115450] amdgpu 0000:c1:00.0: amdgpu: suspend of IP block <mes_v11_0> failed -110
[ 9108.408628] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9108.408641] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 9108.972624] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
[ 9108.972640] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 9108.972648] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B53
[ 9108.972655] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 9108.972660] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x1
[ 9108.972666] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x1
[ 9108.972670] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[ 9108.972675] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 9108.972680] amdgpu 0000:c1:00.0: amdgpu: RW: 0x1
[ 9108.972698] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[ 9108.972705] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 9111.697926] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9111.697935] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 9114.984121] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 9114.984130] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 9114.986411] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 9115.017434] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 9115.017929] [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
[ 9115.018125] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 9115.020249] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 9115.026811] [drm] DMUB hardware initialized: version=0x08004500
[ 9115.379435] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 9115.379446] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 9115.379450] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 9115.379454] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 9115.379457] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 9115.379460] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 9115.379463] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 9115.379466] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 9115.379469] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 9115.379472] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 9115.379475] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 9115.379479] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 9115.379482] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 9115.382346] amdgpu 0000:c1:00.0: amdgpu: GPU reset(1) succeeded!