I have been playing with rocm 6.3.1 on the FW16 without the dGPU, on Ubuntu 24.04
If anyone sees this message:
rocBLAS error: Cannot read /opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103
List of available TensileLibrary Files :
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1151.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm-6.3.1/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
Aborted (core dumped)
This appears to work around the problem:
export HSA_OVERRIDE_GFX_VERSION=11.0.0
If anyone is interested, I am playing with large matrix multiplication in the rust programming language, with it calling out to rocm blas library. My matrix is about 90GBytes, so I will be doing batch or segmented multiplies also.
I will find out if using the iGPU for the task is quicker than using the CPU for the task on the FW16 AMD 7840HS.
My expectation is that both will be just about as fast as each other because the main problem is that the matrix does not fit in RAM and even if I had a smaller, say 20G matrix that did fit in RAM, the operation is probably memory bandwidth constrained and not compute constrained.
Nice one, and good luck. Is it too early to ask if a sparse matrix saves you from sending data thatâs eventually a NOP, or if youâve used this 90GByte dataset as an excuse to fit twin 48GiB DDR5 DIMMs?
The iGPU may only have 12 RDNA3 Compute Units, but they rate at 8.6 TFlops, single precision fused multiply-add (FMA) at 2.8GHz. If you get the drivers working for the 10TFlop ML accelerator (Windows edition, Linux amd/xdna-driver github which has been submitted for Linux Kernel 6.14), maybe that will laugh at the dimensions of your matrices and then weep that itâs not got huge memory bandwidth!
I have not tried Linux amd/xdna-driver.
Is the a web page detailing its capabilities and data bandwidths?
I am interested in matrix multiplication and other matrix ops with complex numbers.
Conclusion:
ROCM is a mess when attempted on the AMD 7840HS iGPU.
It causes general L2 protection faults. And unrecoverable gpu problem needing reboot to resolve.
It works a little with small 3x3 matrix, but larger 10240x10240 fail badly.
So, no ROCM on FW16 amd.
The NPU is an ASIC made by the team that were Xilinx, there is more detail at the link to the Windows driver in my earlier reply. I havenât looked for a picture of its characteristics beyond the â10TFlopsâ proclaimed throughput.
This is nothing to do with FW, they donât make the software.
The rocmblas software is really bad quality.
For example, calling ârocmblas_set_matrix(âŚ)â with invalid parameters actual prints this out: hip error code: 'hipErrorInvalidValue':1 at /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocBLAS/library/src/rocblas_auxiliary.cpp:569
but the return value from that function call is 0 meaning success. Go figure!!!
ROCM seems to behave better on the happy path. Tested and works on a 10240x10240 f32 matrix, but you have to flatten it to a 1D vector for blas.
But any off by one allocation and ROCM fails kind of silently and badly and requires a reboot.
I will probably need to add my own shim rust functions to try and get them to do all the sanity checks before calling rocm for any semblance of rust safety.
Test results:
FW16 AMD iGPU:
ROCM BLAS: 30000 x 30000 matrix of Complex f32 values.
cgemm + sync: 139.92 seconds
RAM used: about 20GBytes.
FW16 AMD CPU:
BLAS: 30000 x 30000 matrix of Complex f32 values.
cgemm (sync not needed): 216.11 seconds.
RAM used: about 20GBytes.
Both run together:
GPU crashes and resets itself, resulting in no answer from the iGPU cgemm + sync.
CPU completes.
I have also done the same test of a few other Desktops.
Currently, my FW16 laptop, without a dGPU is the fastest (2x) computer I have in the house!!!
On the 7840HS, ROCM can use up to 33554432 Kbytes of RAM when VRAM is 2048MB.
This is separate from the VRAM, so if one allocates more VRAM, ROCM has less to use.
VRAM + ROCM RAM == a fixed vale of 34GB or 35bit addressable.
Note: My FW16 has 64GB RAM chips.
So, its a bit like having a GPU with 34GB RAM.
It is not possible to get the GPU to access all the 64GB RAM.
The ROCM model with a APU is:
Allocate the memory block on the GPU.
CPU can directly read/write to that block, no copy from HOST to GPU and back again needed.
Before reading back from GPU one needs to do a GPU sync to ensure its finished its calculations before doing the read.
EDIT:
The 34GB RAM limit can be increased using a few simple configuration commands:
You can try increasing the GTT pool with something like:
/etc/modprobe.d/increase_amd_memory.conf
#Otherwise it's capped to only half the RAM
options amdgpu gttsize=90000 #in MB
options ttm pages_limit=22500000 #4k per page, 90GB total
options ttm page_pool_size=22500000
Note:
(gttsize * 1024) / 4.096 = ttm pages
So, if you wish to use 60GB RAM with ROCM:
options amdgpu gttsize=60000 #in MB
options ttm pages_limit=15000000 #4k per page, 60GB total
options ttm page_pool_size=15000000
Previously, with only 34GB GTT RAM, I could only do a 30000 x 30000 matrix.
Now, with 60GB GTT RAM, I can do:
40000 x 40000 matrix multiplication with complex f32 values takes:
Duration: 387.57099167s
50000 x 50000 matrix multiplication with complex f32 values takes:
Duration: 886.901997998s