Request: verify dGPU support

Background: My friend and I are experimenting with the Ryzen AI Max 395 iGPU and dGPU hybrid LLM inferencing. Currently, we’re using a GMTtec EVO-X2 and a GeForce 4090 (wired by Oculink). For Qwen3-235b Q8&Q4 hybrid quantization, we can get about 14 T/s and 200 T/s for batching, which shows great potential.
So, we believe dGPU is useful even though Ryzen AI Max only provides PCIe 4.0 x4. We want to try the AMD WX7900 48 GB later to reduce cost and increase context length.

Before I got my WX7900, I tried my 9700 XT first, and I found a weird behavior where all hardware information seemed correct. But there’s a hidden 120W power consumption wall for the 9700 XT on Windows and Linux, and even if I enable overclocking, I can only reach about 130W.
The problem is not showing on my 4090.

I did some Google search, and it looks like a common problem for the A + A platform, maybe related to a SmartShift or VBIOS conflict, but I can’t verify them on my current machine.

I’ve ordered two Framework Desktops. Before they ship, I would like the Framework team to help me verify dGPU support.

Thank you.

For the price of a Radeon Pro W7900 48G, you can have 2 Ryzen AI Max 395…
Why not simply chain 3 AI Max with USB-C ? (or only 2 …)

The key is let the dGPU storing KV cache, which could super accelerate the inference process.

Without a dGPU help, the iGPU perfomance similar to M4Pro, you can check my friend test Strix Halo (395)本地运行LLM测试 | David Huang's Blog

Our experiment want to make an affordable and usable solution to deploy llama4-400b, qwen3-235b and ds-671b without quantization too much (to avoid reduce quality) for person or small branch.
Although w7900 is expansive, but it’s much cheaper than compute GPU, and the large memory can help to support large context.

When I get Framework Desktop, I want to build a cluster to test DeepSeek-671b

1 Like

for the APU an UMA you do not need to change GGT size :wink:

Did you test native FP16/BF16 model ?

That’s a Linux Kernel issue.

Ubuntu 24.04 HWE uses 6.11, which ROCM can only see UMA and can’t utilize pre-allocated graphics memory. Other kernels don’t have this issue, but they have inconsistent behaviors, so the software stack is still a trouble.
I haven’t used the new “TheRock” because it hasn’t been released yet. It’s more complex to install, and no significant performance gain.

In the article, he uses an HP Zbook Ultra G1a 64GB to test.
HP Zbook Ultra G1a, ASUS Flow Z 13, and GMKtek EVO-X2 are the only three AI Max products currently available.

The deployment configuration depends on the model. We believe AI Max has potential for MoE models.
However, in practice, AI Max is slower than M4Max (although the GPU is much more powerful) because 256 GB/s memory bandwidth would be a bottleneck.
iGPU + dGPU hybrid inferencing would be a way to overcome it, that’s what we’re researching for.

May be it is best to work on llama.cpp (or use vllm?) to implement tensor parallelism like Feature Request: Tensor Parallelism support · Issue #9086 · ggml-org/llama.cpp · GitHub (yes it have be close without real action, and I did not know if it is possible on llama.cpp without big change…)

1 Like

My friend is a llama.cpp contributor, and we have discussed it.
In short, it’s not easy to modify, but we hope someone can make it because the real bottleneck for AI Max is the memory, which is only up to 128 GB. But you can’t expect a meaningful performance increase.

Our method is similar to KTransformer. You can learn KTransformer for detail, but we don’t use it. Instead, we use llama.cpp (I forgot which PR to allow us to achieve it).

You can’t perform an acceptable performance for large parameter MoE models only with the iGPU. However, if you quantize too much (like q2), the quality would be in trouble, and large parameters are significantly better than small parameters (for example, 30b-a3b), so we build the solution.

I really can’t wait for my MAX to make some bench/tunning…

If you have time I am curious to see what you can have with this bench: GitHub - adelj88/rocm_wmma_samples: Personal project to understand ROCm's WMMA intrinsics
And may be some bench with GitHub - Djip007/llama.cpp at feature/igpu (for BF16 on mistral small?) but maybe it is too tuned for the FW16 …

# something like that...
./build_igpu/bin/llama-bench -ngl 999 --mmap 0 -r 3 -p 4096 -p 1 -p 1 -p 2 -p 4 -p 8 -p 16 -p 32 -p 64 -p 128 -p 256 -p 384 -p 512 -p 1024 -p 2048 -p 4096 -ctk bf16 -ctv bf16 -m Mistral-Small-24B-Instruct-2503-BF16.gguf 

(for now the qkv attention is run on CPU, well in fact only mulmat with simple weight is run on iGPU…)

Yes I have a look to, and may be even not possible… But it is interesting for more that multi-node, it can make multi CPU socket solve NUMA probleme (and may be gain on other case.)
For now I do not have hadware to bench…
May be vllm can do this tensor parallelism, but never test it.

Here you go.

I reinstalled the Ubuntu Server 24.04 with HWE kernel (6.11), ROCm 6.4.1, reset BIOS to default and set to performance mode.

jasl@jasl-gmk-x2:~/Workspaces/llama.cpp on feature/igpu$ HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build/igpu -DGGML_IGPU=ON -DGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release -Dhip_DIR=/opt/rocm/lib/cmake/hip
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- HIP found
-- OpenMP_CXX_FLAGS -fopenmp
-- Including IGPU backend
-- Configuring done (0.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/jasl/Workspaces/llama.cpp/build/igpu

jasl@jasl-gmk-x2:~/Workspaces/llama.cpp on feature/igpu$ ./build/igpu/bin/llama-bench -ngl 999 --mmap 0 -r 3 -p 4096 -p 1 -p 1 -p 2 -p 4 -p 8 -p 16 -p 32 -p 64 -p 128 -p 256 -p 384 -p 512 -p 1024 -p 2048 -p 4096 -ctk bf16 -ctv bf16 -m ~/Mistral-Small-3.1-24B-Instruct-2503-BF16.gguf
ggml-igpu: backend[IGPU] create
ggml-igpu: device[IGPU<0>::0] added: AMD Radeon Graphics (gfx1151)
| model                          |       size |     params | backend    | ngl | type_k | type_v | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | ---: | ------------: | -------------------: |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |        pp4096 |        166.67 ± 0.31 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp1 |          3.38 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp1 |          3.39 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp2 |          6.36 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp4 |         12.40 ± 0.01 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp8 |         21.64 ± 0.02 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |          pp16 |         33.66 ± 0.04 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |          pp32 |         42.05 ± 0.06 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |          pp64 |         65.58 ± 0.33 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |         pp128 |        120.07 ± 0.79 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |         pp256 |        155.27 ± 0.47 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |         pp384 |        201.88 ± 0.55 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |         pp512 |        207.23 ± 2.23 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |        pp1024 |        201.68 ± 0.85 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |        pp2048 |        188.54 ± 0.34 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |        pp4096 |        166.53 ± 0.51 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |         tg128 |          3.39 ± 0.00 |

build: 209f72be (4637)

It can’t complete rocm_wmma_samples, because ROCm doesn’t support gfx1151 yet.

rocBLAS error: Cannot read /opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1151
 List of available TensileLibrary Files :
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
[1]    16890 IOT instruction (core dumped)  ./hgemm/test

and a test will fail

[  FAILED  ] HGEMMTest/0.Size512, where TypeParam = KernelTypeWrapper<(kernel_type)0> (1037 ms)

bench output

jasl@jasl-gmk-x2:~/Workspaces/rocm_wmma_samples/build on main$ ./hgemm/bench
2025-05-24T13:28:06+00:00
Running ./hgemm/bench
Run on (32 X 5185 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.53, 1.37, 3.41
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:1024,n:1024,k:1024}/manual_time                         1.06 ms         1.08 ms          670 TFLOPS=2.03403 bytes_per_second=5.54948Gi/s
{hgemm:kernel_type::shared,m:2048,n:2048,k:2048}/manual_time                         8.43 ms         8.45 ms           83 TFLOPS=2.03908 bytes_per_second=2.78172Gi/s
{hgemm:kernel_type::shared,m:4096,n:4096,k:4096}/manual_time                         68.4 ms         68.4 ms           10 TFLOPS=2.01074 bytes_per_second=1.37156Gi/s
{hgemm:kernel_type::shared,m:8192,n:8192,k:8192}/manual_time                          713 ms          713 ms            1 TFLOPS=1.54311 bytes_per_second=538.926Mi/s
{hgemm:kernel_type::wmma_naive,m:1024,n:1024,k:1024}/manual_time                     1.57 ms         1.60 ms          443 TFLOPS=1.36856 bytes_per_second=3.72741Gi/s
{hgemm:kernel_type::wmma_naive,m:2048,n:2048,k:2048}/manual_time                     6.01 ms         6.03 ms          117 TFLOPS=2.86303 bytes_per_second=3.90163Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:4096,k:4096}/manual_time                     49.0 ms         49.0 ms           14 TFLOPS=2.80638 bytes_per_second=1.91407Gi/s
{hgemm:kernel_type::wmma_naive,m:8192,n:8192,k:8192}/manual_time                      528 ms          529 ms            1 TFLOPS=2.08046 bytes_per_second=726.591Mi/s
{hgemm:kernel_type::wmma_shared,m:1024,n:1024,k:1024}/manual_time                   0.353 ms        0.379 ms         1981 TFLOPS=6.07873 bytes_per_second=16.5851Gi/s
{hgemm:kernel_type::wmma_shared,m:2048,n:2048,k:2048}/manual_time                    2.30 ms         2.32 ms          305 TFLOPS=7.48366 bytes_per_second=10.2095Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:4096,k:4096}/manual_time                    20.2 ms         20.3 ms           35 TFLOPS=6.79073 bytes_per_second=4.63207Gi/s
{hgemm:kernel_type::wmma_shared,m:8192,n:8192,k:8192}/manual_time                     220 ms          220 ms            3 TFLOPS=5.01002 bytes_per_second=1.70798Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:1024,n:1024,k:1024}/manual_time              0.269 ms        0.295 ms         2601 TFLOPS=7.97663 bytes_per_second=21.7633Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:2048,n:2048,k:2048}/manual_time               1.23 ms         1.26 ms          570 TFLOPS=13.9681 bytes_per_second=19.0545Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:4096,k:4096}/manual_time               10.5 ms         10.5 ms           67 TFLOPS=13.1256 bytes_per_second=8.95295Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:8192,n:8192,k:8192}/manual_time               89.8 ms         89.9 ms            8 TFLOPS=12.2387 bytes_per_second=4.17412Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:1024,n:1024,k:1024}/manual_time          0.259 ms        0.285 ms         2707 TFLOPS=8.30634 bytes_per_second=22.6626Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:2048,n:2048,k:2048}/manual_time           1.20 ms         1.22 ms          583 TFLOPS=14.3428 bytes_per_second=19.5654Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:4096,k:4096}/manual_time           10.2 ms         10.2 ms           69 TFLOPS=13.455 bytes_per_second=9.17769Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:8192,n:8192,k:8192}/manual_time           88.2 ms         88.2 ms            8 TFLOPS=12.4716 bytes_per_second=4.25356Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:1024,n:1024,k:1024}/manual_time          0.106 ms        0.132 ms         6592 TFLOPS=20.2275 bytes_per_second=55.1748Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:2048,n:2048,k:2048}/manual_time          0.592 ms        0.619 ms         1188 TFLOPS=29.0207 bytes_per_second=39.59Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:4096,k:4096}/manual_time           4.52 ms         4.54 ms          156 TFLOPS=30.4201 bytes_per_second=20.7493Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:8192,n:8192,k:8192}/manual_time           41.2 ms         41.2 ms           17 TFLOPS=26.7309 bytes_per_second=9.10732Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:1024,n:1024,k:1024}/manual_time      0.104 ms        0.130 ms         6713 TFLOPS=20.6082 bytes_per_second=56.2039Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:2048,n:2048,k:2048}/manual_time      0.595 ms        0.622 ms         1181 TFLOPS=28.893 bytes_per_second=39.416Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:4096,k:4096}/manual_time       4.18 ms         4.20 ms          168 TFLOPS=32.9175 bytes_per_second=22.4535Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:8192,n:8192,k:8192}/manual_time       50.2 ms         50.2 ms           10 TFLOPS=21.9824 bytes_per_second=7.46875Gi/s
{hgemm:kernel_type::wmma_prefetch,m:1024,n:1024,k:1024}/manual_time                 0.106 ms        0.132 ms         6585 TFLOPS=20.2081 bytes_per_second=55.1271Gi/s
{hgemm:kernel_type::wmma_prefetch,m:2048,n:2048,k:2048}/manual_time                 0.637 ms        0.664 ms         1100 TFLOPS=26.9905 bytes_per_second=36.8211Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:4096,k:4096}/manual_time                  4.27 ms         4.30 ms          164 TFLOPS=32.1968 bytes_per_second=21.9618Gi/s
{hgemm:kernel_type::wmma_prefetch,m:8192,n:8192,k:8192}/manual_time                  54.4 ms         54.4 ms           12 TFLOPS=20.2429 bytes_per_second=6.8908Gi/s
{hgemm:kernel_type::wmma_opt_1,m:1024,n:1024,k:1024}/manual_time                    0.095 ms        0.121 ms         7359 TFLOPS=22.5902 bytes_per_second=61.6273Gi/s
{hgemm:kernel_type::wmma_opt_1,m:2048,n:2048,k:2048}/manual_time                    0.555 ms        0.583 ms         1283 TFLOPS=30.9499 bytes_per_second=42.2188Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:4096,k:4096}/manual_time                     3.59 ms         3.62 ms          195 TFLOPS=38.2431 bytes_per_second=26.0856Gi/s
{hgemm:kernel_type::wmma_opt_1,m:8192,n:8192,k:8192}/manual_time                     59.4 ms         59.4 ms           10 TFLOPS=18.5368 bytes_per_second=6.31439Gi/s
{hgemm:kernel_type::wmma_opt_2,m:1024,n:1024,k:1024}/manual_time                    0.128 ms        0.154 ms         5449 TFLOPS=16.7178 bytes_per_second=45.613Gi/s
{hgemm:kernel_type::wmma_opt_2,m:2048,n:2048,k:2048}/manual_time                    0.501 ms        0.529 ms         1000 TFLOPS=34.2711 bytes_per_second=46.7504Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:4096,k:4096}/manual_time                     3.32 ms         3.35 ms          211 TFLOPS=41.3831 bytes_per_second=28.2276Gi/s
{hgemm:kernel_type::wmma_opt_2,m:8192,n:8192,k:8192}/manual_time                     25.9 ms         26.0 ms           27 TFLOPS=42.379 bytes_per_second=14.4514Gi/s
{hgemm:kernel_type::wmma_opt_3,m:1024,n:1024,k:1024}/manual_time                    0.142 ms        0.168 ms         4920 TFLOPS=15.0917 bytes_per_second=41.1765Gi/s
{hgemm:kernel_type::wmma_opt_3,m:2048,n:2048,k:2048}/manual_time                    0.540 ms        0.568 ms         1312 TFLOPS=31.7925 bytes_per_second=43.3703Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:4096,k:4096}/manual_time                     3.54 ms         3.57 ms          198 TFLOPS=38.7907 bytes_per_second=26.4599Gi/s
{hgemm:kernel_type::wmma_opt_3,m:8192,n:8192,k:8192}/manual_time                     26.8 ms         26.8 ms           26 TFLOPS=41.0581 bytes_per_second=14.0031Gi/s
{hgemm:kernel_type::wmma_opt_4,m:1024,n:1024,k:1024}/manual_time                    0.127 ms        0.153 ms         5509 TFLOPS=16.897 bytes_per_second=46.1011Gi/s
{hgemm:kernel_type::wmma_opt_4,m:2048,n:2048,k:2048}/manual_time                    0.498 ms        0.525 ms         1409 TFLOPS=34.4888 bytes_per_second=47.047Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:4096,k:4096}/manual_time                     3.30 ms         3.32 ms          214 TFLOPS=41.6838 bytes_per_second=28.4324Gi/s
{hgemm:kernel_type::wmma_opt_4,m:8192,n:8192,k:8192}/manual_time                     25.1 ms         25.1 ms           28 TFLOPS=43.7682 bytes_per_second=14.9275Gi/s

rocBLAS error: Cannot read /opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1151
 List of available TensileLibrary Files :
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
[1]    17119 IOT instruction (core dumped)  ./hgemm/bench
1 Like

Thanks…
my last bench get (when no crach…)

| model                          |       size |     params | backend    | ngl | type_k | type_v | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | ---: | --------------: | -------------------: |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |             pp1 |          1.73 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |             pp1 |          1.73 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |             pp2 |          3.42 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |             pp4 |          6.65 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |             pp8 |         12.58 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |            pp16 |         22.42 ± 0.02 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |            pp32 |         30.25 ± 0.03 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |            pp64 |         53.51 ± 0.03 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp128 |         69.58 ± 0.07 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp256 |         86.45 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp384 |         85.78 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           pp512 |         88.10 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |          pp1024 |         85.78 ± 0.00 |
| llama 13B BF16                 |  43.91 GiB |    23.57 B | IGPU       | 999 |   bf16 |   bf16 |    0 |           tg128 |          1.72 ± 0.00 |

not that bad but I need more work/tuning for it.

I can’t build rocm_wmma_samples to (and I may need to re test…)
43 TFlops look promising I think I only get <15 with my kernel. so look there is some room for improvement.

I’m glad to see people starting to unleash Strix Halo’s true power.
You can ping me here if you need to do some tests on AI Max.

I would not recommend buying EVO X2 because it has various issues, but it is currently the only AI Max Desktop.

Thanks. If I have time I may give you some more to bench… but need more time :wink:

I did test the last hgemm … and it is interesting.
with default config I get:

-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:1024,n:1024,k:1024}/manual_time                         3.48 ms         3.36 ms          201 TFLOPS=0.617884 bytes_per_second=1.68587Gi/s
{hgemm:kernel_type::shared,m:2048,n:2048,k:2048}/manual_time                         30.6 ms         29.4 ms           23 TFLOPS=0.561361 bytes_per_second=784.208Mi/s
{hgemm:kernel_type::shared,m:4096,n:4096,k:4096}/manual_time                          244 ms          243 ms            3 TFLOPS=0.563502 bytes_per_second=393.601Mi/s
{hgemm:kernel_type::shared,m:8192,n:8192,k:8192}/manual_time                         1960 ms         1943 ms            1 TFLOPS=0.560897 bytes_per_second=195.891Mi/s
{hgemm:kernel_type::wmma_naive,m:1024,n:1024,k:1024}/manual_time                     2.24 ms         2.26 ms          296 TFLOPS=0.957727 bytes_per_second=2.61292Gi/s
{hgemm:kernel_type::wmma_naive,m:2048,n:2048,k:2048}/manual_time                     13.8 ms         13.7 ms           50 TFLOPS=1.24787 bytes_per_second=1.70183Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:4096,k:4096}/manual_time                      175 ms          175 ms            4 TFLOPS=0.78327 bytes_per_second=547.107Mi/s
{hgemm:kernel_type::wmma_naive,m:8192,n:8192,k:8192}/manual_time                     1607 ms         1602 ms            1 TFLOPS=0.68412 bytes_per_second=238.926Mi/s
{hgemm:kernel_type::wmma_shared,m:1024,n:1024,k:1024}/manual_time                   0.933 ms        0.950 ms          558 TFLOPS=2.30206 bytes_per_second=6.27989Gi/s
{hgemm:kernel_type::wmma_shared,m:2048,n:2048,k:2048}/manual_time                    10.5 ms         10.5 ms           64 TFLOPS=1.64068 bytes_per_second=2.2365Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:4096,k:4096}/manual_time                    77.6 ms         77.4 ms            8 TFLOPS=1.77166 bytes_per_second=1.20845Gi/s
{hgemm:kernel_type::wmma_shared,m:8192,n:8192,k:8192}/manual_time                     716 ms          714 ms            1 TFLOPS=1.53489 bytes_per_second=536.056Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:1024,n:1024,k:1024}/manual_time               3.92 ms         3.93 ms          178 TFLOPS=0.548059 bytes_per_second=1.49521Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:2048,n:2048,k:2048}/manual_time               30.9 ms         30.9 ms           22 TFLOPS=0.555448 bytes_per_second=775.73Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:4096,k:4096}/manual_time                250 ms          250 ms            3 TFLOPS=0.549295 bytes_per_second=383.668Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:8192,n:8192,k:8192}/manual_time               2052 ms         2047 ms            1 TFLOPS=0.535741 bytes_per_second=187.105Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:1024,n:1024,k:1024}/manual_time           3.84 ms         3.84 ms          182 TFLOPS=0.559416 bytes_per_second=1.52621Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:2048,n:2048,k:2048}/manual_time           40.8 ms         40.7 ms           17 TFLOPS=0.420842 bytes_per_second=587.726Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:4096,k:4096}/manual_time            252 ms          252 ms            3 TFLOPS=0.544357 bytes_per_second=380.219Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:8192,n:8192,k:8192}/manual_time           2038 ms         1987 ms            1 TFLOPS=0.539572 bytes_per_second=188.443Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:1024,n:1024,k:1024}/manual_time           4.00 ms         3.98 ms          175 TFLOPS=0.536504 bytes_per_second=1.46324Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:2048,n:2048,k:2048}/manual_time           37.1 ms         36.9 ms           19 TFLOPS=0.462848 bytes_per_second=646.526Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:4096,k:4096}/manual_time            315 ms          314 ms            2 TFLOPS=0.435824 bytes_per_second=304.42Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:8192,n:8192,k:8192}/manual_time           2599 ms         2591 ms            1 TFLOPS=0.423027 bytes_per_second=147.74Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:1024,n:1024,k:1024}/manual_time       3.67 ms         3.68 ms          190 TFLOPS=0.604062 bytes_per_second=1.59762Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:2048,n:2048,k:2048}/manual_time       35.0 ms         34.5 ms           19 TFLOPS=0.491014 bytes_per_second=685.921Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:4096,k:4096}/manual_time        302 ms          301 ms            2 TFLOPS=0.455492 bytes_per_second=318.156Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:8192,n:8192,k:8192}/manual_time       2446 ms         2439 ms            1 TFLOPS=0.449461 bytes_per_second=156.973Mi/s
{hgemm:kernel_type::wmma_prefetch,m:1024,n:1024,k:1024}/manual_time                  2.20 ms         2.22 ms          314 TFLOPS=0.975214 bytes_per_second=2.65925Gi/s
{hgemm:kernel_type::wmma_prefetch,m:2048,n:2048,k:2048}/manual_time                  21.5 ms         21.4 ms           32 TFLOPS=0.800195 bytes_per_second=1.09158Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:4096,k:4096}/manual_time                   182 ms          182 ms            4 TFLOPS=0.757997 bytes_per_second=527.1Mi/s
{hgemm:kernel_type::wmma_prefetch,m:8192,n:8192,k:8192}/manual_time                  1623 ms         1618 ms            1 TFLOPS=0.677647 bytes_per_second=236.666Mi/s
{hgemm:kernel_type::wmma_opt_1,m:1024,n:1024,k:1024}/manual_time                     4.21 ms         4.21 ms          164 TFLOPS=0.517805 bytes_per_second=1.39336Gi/s
{hgemm:kernel_type::wmma_opt_1,m:2048,n:2048,k:2048}/manual_time                     41.1 ms         39.5 ms           17 TFLOPS=0.418247 bytes_per_second=584.271Mi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:4096,k:4096}/manual_time                      338 ms          335 ms            2 TFLOPS=0.406989 bytes_per_second=284.279Mi/s
{hgemm:kernel_type::wmma_opt_1,m:8192,n:8192,k:8192}/manual_time                     2715 ms         2608 ms            1 TFLOPS=0.40491 bytes_per_second=141.413Mi/s
{hgemm:kernel_type::wmma_opt_2,m:1024,n:1024,k:1024}/manual_time                     1.88 ms         1.89 ms          361 TFLOPS=1.14198 bytes_per_second=3.11308Gi/s
{hgemm:kernel_type::wmma_opt_2,m:2048,n:2048,k:2048}/manual_time                     16.7 ms         16.2 ms           42 TFLOPS=1.02704 bytes_per_second=1.40103Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:4096,k:4096}/manual_time                      140 ms          140 ms            5 TFLOPS=0.979427 bytes_per_second=684.114Mi/s
{hgemm:kernel_type::wmma_opt_2,m:8192,n:8192,k:8192}/manual_time                     1139 ms         1094 ms            1 TFLOPS=0.965238 bytes_per_second=337.106Mi/s
{hgemm:kernel_type::wmma_opt_3,m:1024,n:1024,k:1024}/manual_time                     1.55 ms         1.56 ms          425 TFLOPS=1.39208 bytes_per_second=3.79245Gi/s
{hgemm:kernel_type::wmma_opt_3,m:2048,n:2048,k:2048}/manual_time                     13.8 ms         13.8 ms           50 TFLOPS=1.2434 bytes_per_second=1.69582Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:4096,k:4096}/manual_time                      119 ms          118 ms            6 TFLOPS=1.15431 bytes_per_second=806.265Mi/s
{hgemm:kernel_type::wmma_opt_3,m:8192,n:8192,k:8192}/manual_time                     1040 ms         1036 ms            1 TFLOPS=1.05743 bytes_per_second=369.304Mi/s
{hgemm:kernel_type::wmma_opt_4,m:1024,n:1024,k:1024}/manual_time                    0.496 ms        0.516 ms         1357 TFLOPS=4.33507 bytes_per_second=11.8025Gi/s
{hgemm:kernel_type::wmma_opt_4,m:2048,n:2048,k:2048}/manual_time                     3.49 ms         3.50 ms          189 TFLOPS=4.95832 bytes_per_second=6.71357Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:4096,k:4096}/manual_time                     34.5 ms         34.4 ms           20 TFLOPS=3.98225 bytes_per_second=2.7157Gi/s
{hgemm:kernel_type::wmma_opt_4,m:8192,n:8192,k:8192}/manual_time                      310 ms          309 ms            2 TFLOPS=3.54481 bytes_per_second=1.209Gi/s
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time                       0.811 ms        0.831 ms          849 TFLOPS=2.64929 bytes_per_second=7.22589Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time                        6.35 ms         6.36 ms           91 TFLOPS=2.71682 bytes_per_second=3.68985Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time                        28.6 ms         28.5 ms           24 TFLOPS=4.81047 bytes_per_second=3.28086Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time                         227 ms          226 ms            3 TFLOPS=4.85445 bytes_per_second=1.65531Gi/s

as you look it is horrible :wink:

For this APU we need to change the config size : with 4x2 / 2x4 (and not the 4x4 / 4x4)… with that I get:

-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:1024,n:1024,k:1024}/manual_time                         3.47 ms         3.48 ms          200 TFLOPS=0.619061 bytes_per_second=1.68908Gi/s
{hgemm:kernel_type::shared,m:2048,n:2048,k:2048}/manual_time                         29.6 ms         29.5 ms           23 TFLOPS=0.58104 bytes_per_second=811.68Mi/s
{hgemm:kernel_type::shared,m:4096,n:4096,k:4096}/manual_time                          245 ms          244 ms            3 TFLOPS=0.561046 bytes_per_second=391.886Mi/s
{hgemm:kernel_type::shared,m:8192,n:8192,k:8192}/manual_time                         1947 ms         1931 ms            1 TFLOPS=0.564614 bytes_per_second=197.189Mi/s
{hgemm:kernel_type::wmma_naive,m:1024,n:1024,k:1024}/manual_time                     2.24 ms         2.26 ms          292 TFLOPS=0.957852 bytes_per_second=2.61325Gi/s
{hgemm:kernel_type::wmma_naive,m:2048,n:2048,k:2048}/manual_time                     13.4 ms         13.4 ms           52 TFLOPS=1.28136 bytes_per_second=1.7469Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:4096,k:4096}/manual_time                      169 ms          169 ms            4 TFLOPS=0.812453 bytes_per_second=567.486Mi/s
{hgemm:kernel_type::wmma_naive,m:8192,n:8192,k:8192}/manual_time                     1599 ms         1594 ms            1 TFLOPS=0.687432 bytes_per_second=240.083Mi/s
{hgemm:kernel_type::wmma_shared,m:1024,n:1024,k:1024}/manual_time                    1.01 ms         1.03 ms          535 TFLOPS=2.13061 bytes_per_second=5.81247Gi/s
{hgemm:kernel_type::wmma_shared,m:2048,n:2048,k:2048}/manual_time                    9.44 ms         9.43 ms           68 TFLOPS=1.82273 bytes_per_second=2.48359Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:4096,k:4096}/manual_time                    84.6 ms         84.4 ms            7 TFLOPS=1.62375 bytes_per_second=1.10759Gi/s
{hgemm:kernel_type::wmma_shared,m:8192,n:8192,k:8192}/manual_time                     893 ms          891 ms            1 TFLOPS=1.23072 bytes_per_second=429.824Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:1024,n:1024,k:1024}/manual_time              0.862 ms        0.880 ms          594 TFLOPS=2.49131 bytes_per_second=6.79626Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:2048,n:2048,k:2048}/manual_time               10.2 ms         10.2 ms           94 TFLOPS=1.68698 bytes_per_second=2.2998Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:4096,k:4096}/manual_time               82.2 ms         82.0 ms            9 TFLOPS=1.67288 bytes_per_second=1.14071Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:8192,n:8192,k:8192}/manual_time                747 ms          745 ms            1 TFLOPS=1.4718 bytes_per_second=514.019Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:1024,n:1024,k:1024}/manual_time          0.966 ms        0.983 ms          818 TFLOPS=2.22354 bytes_per_second=6.06639Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:2048,n:2048,k:2048}/manual_time           8.98 ms         8.96 ms           71 TFLOPS=1.91643 bytes_per_second=2.61072Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:4096,k:4096}/manual_time           68.2 ms         68.0 ms            8 TFLOPS=2.01666 bytes_per_second=1.37555Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:8192,n:8192,k:8192}/manual_time            741 ms          739 ms            1 TFLOPS=1.48344 bytes_per_second=518.086Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:1024,n:1024,k:1024}/manual_time          0.285 ms        0.306 ms         2239 TFLOPS=7.53191 bytes_per_second=20.5462Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:2048,n:2048,k:2048}/manual_time           2.81 ms         2.82 ms          244 TFLOPS=6.16142 bytes_per_second=8.34298Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:4096,k:4096}/manual_time           30.9 ms         30.9 ms           22 TFLOPS=4.44535 bytes_per_second=3.03155Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:8192,n:8192,k:8192}/manual_time            362 ms          361 ms            2 TFLOPS=3.03907 bytes_per_second=1.0365Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:1024,n:1024,k:1024}/manual_time      0.275 ms        0.296 ms         2341 TFLOPS=7.80504 bytes_per_second=21.2901Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:2048,n:2048,k:2048}/manual_time       2.60 ms         2.61 ms          247 TFLOPS=6.6617 bytes_per_second=9.03007Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:4096,k:4096}/manual_time       35.5 ms         35.4 ms           20 TFLOPS=3.8728 bytes_per_second=2.64106Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:8192,n:8192,k:8192}/manual_time        387 ms          386 ms            2 TFLOPS=2.84352 bytes_per_second=993.086Mi/s
{hgemm:kernel_type::wmma_prefetch,m:1024,n:1024,k:1024}/manual_time                 0.281 ms        0.301 ms         2296 TFLOPS=7.65088 bytes_per_second=20.8711Gi/s
{hgemm:kernel_type::wmma_prefetch,m:2048,n:2048,k:2048}/manual_time                  2.71 ms         2.72 ms          256 TFLOPS=6.38203 bytes_per_second=8.64167Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:4096,k:4096}/manual_time                  32.6 ms         32.5 ms           22 TFLOPS=4.23867 bytes_per_second=2.87918Gi/s
{hgemm:kernel_type::wmma_prefetch,m:8192,n:8192,k:8192}/manual_time                   384 ms          383 ms            2 TFLOPS=2.86278 bytes_per_second=999.649Mi/s
{hgemm:kernel_type::wmma_opt_1,m:1024,n:1024,k:1024}/manual_time                    0.262 ms        0.284 ms         2448 TFLOPS=8.18948 bytes_per_second=22.3256Gi/s
{hgemm:kernel_type::wmma_opt_1,m:2048,n:2048,k:2048}/manual_time                     2.48 ms         2.50 ms          259 TFLOPS=6.95749 bytes_per_second=9.43684Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:4096,k:4096}/manual_time                     32.0 ms         31.9 ms           21 TFLOPS=4.29709 bytes_per_second=2.92791Gi/s
{hgemm:kernel_type::wmma_opt_1,m:8192,n:8192,k:8192}/manual_time                      373 ms          372 ms            2 TFLOPS=2.94894 bytes_per_second=1.00576Gi/s
{hgemm:kernel_type::wmma_opt_2,m:1024,n:1024,k:1024}/manual_time                    0.224 ms        0.245 ms         2825 TFLOPS=9.6022 bytes_per_second=26.1952Gi/s
{hgemm:kernel_type::wmma_opt_2,m:2048,n:2048,k:2048}/manual_time                     1.66 ms         1.67 ms          342 TFLOPS=10.5321 bytes_per_second=14.1535Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:4096,k:4096}/manual_time                     12.7 ms         12.7 ms           47 TFLOPS=10.7918 bytes_per_second=7.3593Gi/s
{hgemm:kernel_type::wmma_opt_2,m:8192,n:8192,k:8192}/manual_time                      146 ms          146 ms            6 TFLOPS=7.63096 bytes_per_second=2.56267Gi/s
{hgemm:kernel_type::wmma_opt_3,m:1024,n:1024,k:1024}/manual_time                    0.371 ms        0.391 ms         1846 TFLOPS=5.79339 bytes_per_second=15.8025Gi/s
{hgemm:kernel_type::wmma_opt_3,m:2048,n:2048,k:2048}/manual_time                     2.83 ms         2.83 ms          241 TFLOPS=6.12953 bytes_per_second=8.28346Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:4096,k:4096}/manual_time                     22.1 ms         22.1 ms           31 TFLOPS=6.21059 bytes_per_second=4.23623Gi/s
{hgemm:kernel_type::wmma_opt_3,m:8192,n:8192,k:8192}/manual_time                      209 ms          208 ms            3 TFLOPS=5.261 bytes_per_second=1.79432Gi/s
{hgemm:kernel_type::wmma_opt_4,m:1024,n:1024,k:1024}/manual_time                    0.240 ms        0.261 ms         2650 TFLOPS=8.94375 bytes_per_second=24.3992Gi/s
{hgemm:kernel_type::wmma_opt_4,m:2048,n:2048,k:2048}/manual_time                     1.84 ms         1.86 ms          318 TFLOPS=9.50211 bytes_per_second=12.7354Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:4096,k:4096}/manual_time                     14.6 ms         14.6 ms           47 TFLOPS=9.43729 bytes_per_second=6.4159Gi/s
{hgemm:kernel_type::wmma_opt_4,m:8192,n:8192,k:8192}/manual_time                      142 ms          142 ms            5 TFLOPS=7.87731 bytes_per_second=2.63676Gi/s
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time                       0.809 ms        0.830 ms          861 TFLOPS=2.65381 bytes_per_second=7.2403Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time                        6.33 ms         6.33 ms           91 TFLOPS=2.72738 bytes_per_second=3.70353Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time                        28.2 ms         28.1 ms           25 TFLOPS=4.88062 bytes_per_second=3.32902Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time                         224 ms          224 ms            3 TFLOPS=4.89919 bytes_per_second=1.67091Gi/s


-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:4096,n:128,k:16384}/manual_time                         37.9 ms         37.9 ms           19 TFLOPS=0.452836 bytes_per_second=3.42343Gi/s
{hgemm:kernel_type::shared,m:4096,n:256,k:16384}/manual_time                         75.4 ms         75.2 ms            9 TFLOPS=0.455428 bytes_per_second=1.78626Gi/s
{hgemm:kernel_type::shared,m:4096,n:512,k:16384}/manual_time                          151 ms          151 ms            5 TFLOPS=0.454019 bytes_per_second=977.804Mi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:128,k:16384}/manual_time                     36.9 ms         36.8 ms           18 TFLOPS=0.465934 bytes_per_second=3.52246Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:256,k:16384}/manual_time                     77.3 ms         77.1 ms            9 TFLOPS=0.444647 bytes_per_second=1.74387Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:512,k:16384}/manual_time                      150 ms          149 ms            5 TFLOPS=0.459304 bytes_per_second=989.148Mi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:128,k:16384}/manual_time                    13.7 ms         13.7 ms           54 TFLOPS=1.25153 bytes_per_second=9.45662Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:256,k:16384}/manual_time                    26.0 ms         25.9 ms           25 TFLOPS=1.32278 bytes_per_second=5.18818Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:512,k:16384}/manual_time                    52.5 ms         52.4 ms           12 TFLOPS=1.30833 bytes_per_second=2.75167Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:128,k:16384}/manual_time               12.9 ms         12.9 ms           53 TFLOPS=1.32921 bytes_per_second=10.0403Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:256,k:16384}/manual_time               20.6 ms         20.6 ms           34 TFLOPS=1.66657 bytes_per_second=6.53644Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:512,k:16384}/manual_time               44.7 ms         44.6 ms           16 TFLOPS=1.53647 bytes_per_second=3.23149Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:128,k:16384}/manual_time           13.0 ms         13.0 ms           53 TFLOPS=1.32251 bytes_per_second=9.99036Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:256,k:16384}/manual_time           20.2 ms         20.1 ms           34 TFLOPS=1.70177 bytes_per_second=6.67452Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:512,k:16384}/manual_time           43.4 ms         43.3 ms           16 TFLOPS=1.58234 bytes_per_second=3.32796Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:128,k:16384}/manual_time           3.02 ms         3.02 ms          226 TFLOPS=5.78647 bytes_per_second=43.067Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:256,k:16384}/manual_time           5.40 ms         5.40 ms          103 TFLOPS=6.41001 bytes_per_second=24.9471Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:512,k:16384}/manual_time           19.1 ms         19.0 ms           37 TFLOPS=3.60631 bytes_per_second=7.5767Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:128,k:16384}/manual_time       2.70 ms         2.71 ms          242 TFLOPS=6.49395 bytes_per_second=48.0907Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:256,k:16384}/manual_time       5.26 ms         5.26 ms          107 TFLOPS=6.587 bytes_per_second=25.6191Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:512,k:16384}/manual_time       18.5 ms         18.5 ms           37 TFLOPS=3.71419 bytes_per_second=7.7919Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:128,k:16384}/manual_time                  2.43 ms         2.44 ms          264 TFLOPS=7.2384 bytes_per_second=53.433Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:256,k:16384}/manual_time                  4.81 ms         4.81 ms          113 TFLOPS=7.21005 bytes_per_second=28.0302Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:512,k:16384}/manual_time                  13.9 ms         13.9 ms           48 TFLOPS=5.03227 bytes_per_second=10.4089Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:128,k:16384}/manual_time                     2.50 ms         2.50 ms          257 TFLOPS=7.02371 bytes_per_second=52.0044Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:256,k:16384}/manual_time                     4.61 ms         4.61 ms          114 TFLOPS=7.52881 bytes_per_second=29.2088Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:512,k:16384}/manual_time                     16.8 ms         16.7 ms           41 TFLOPS=4.12055 bytes_per_second=8.62215Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:128,k:16384}/manual_time                     2.12 ms         2.13 ms          297 TFLOPS=8.26598 bytes_per_second=61.2295Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:256,k:16384}/manual_time                     3.46 ms         3.47 ms          200 TFLOPS=10.0568 bytes_per_second=38.9209Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:512,k:16384}/manual_time                     6.85 ms         6.84 ms           88 TFLOPS=10.3186 bytes_per_second=21.1117Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:128,k:16384}/manual_time                     4.59 ms         4.48 ms          114 TFLOPS=3.77939 bytes_per_second=28.2773Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:256,k:16384}/manual_time                     7.70 ms         7.69 ms           81 TFLOPS=4.47885 bytes_per_second=17.497Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:512,k:16384}/manual_time                     13.3 ms         12.8 ms           50 TFLOPS=5.175 bytes_per_second=10.8775Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:128,k:16384}/manual_time                     3.27 ms         3.27 ms          206 TFLOPS=5.31326 bytes_per_second=39.7381Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:256,k:16384}/manual_time                     4.15 ms         4.00 ms          166 TFLOPS=8.36172 bytes_per_second=32.447Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:512,k:16384}/manual_time                     8.10 ms         7.87 ms           77 TFLOPS=8.50537 bytes_per_second=17.835Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:128,k:16384}/manual_time                        8.57 ms         8.56 ms           76 TFLOPS=2.00911 bytes_per_second=15.1509Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:256,k:16384}/manual_time                        16.6 ms         16.5 ms           43 TFLOPS=2.07607 bytes_per_second=8.14096Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:512,k:16384}/manual_time                        33.2 ms         33.1 ms           20 TFLOPS=2.07233 bytes_per_second=4.35838Gi/s

the pick perf is ~18TFlops …

What is itteresting is that for now my kernel only achive ~5TFlops si it look I can do better.
The other is I tune it for my GPU, and it is not the best for the MAX GPU.

need to look closely but I can have gain on my kernel for the “old” Ryzen 7940HS… and a lot more for the MAX (x3 ???)

:crossed_fingers:

(keep in mind that the bench compute A[fp16]@B[fp16]=C[fp16] and we need to compute trans(A[fp16/bf16])@B[fp32]=C[fp32]…)

What did you get with rocminfo:

  Name:                    gfx1103                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon 780M                    
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5567(0x15bf)                       
  ASIC Revision:           7(0x7)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2799                               
  BDFID:                   49920                              
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 40                                 
  SDMA engine uCode::      21                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    28688300(0x1b5bfac) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    28688300(0x1b5bfac) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1103         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 

curus to know what diff there is…

I like benchmarks. This is with Fedora Rawhide:

6.15.0-0.rc5.250509g9c69f8884904.47.fc43.x86_64

And the latest nightly Release nightly-tarball · ROCm/TheRock · GitHub (very easy to install, just untar in /opt/rocm)

# therock-dist-linux-gfx1151-6.5.0rc20250524.tar.gz
$ hipcc --version
HIP version: 6.5.25206-9a10e7a3b
AMD clang version 19.0.0git (https://github.com/ROCm/llvm-project.git 575153a10ced9bac27ef3f4878c60b84d345ca79)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/lib/llvm/bin

hgemm:

~/rocm_wmma_samples/build (main)$ hgemm/bench
2025-05-24T23:49:06-07:00
Running hgemm/bench
Run on (32 X 1976.87 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.88, 0.44, 0.21
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:1024,n:1024,k:1024}/manual_time                         1.36 ms         1.38 ms          513 TFLOPS=1.57794 bytes_per_second=4.30424Gi/s
{hgemm:kernel_type::shared,m:2048,n:2048,k:2048}/manual_time                         10.5 ms         10.6 ms           60 TFLOPS=1.62961 bytes_per_second=2.22205Gi/s
{hgemm:kernel_type::shared,m:4096,n:4096,k:4096}/manual_time                         82.3 ms         82.1 ms            9 TFLOPS=1.66984 bytes_per_second=1.13893Gi/s
{hgemm:kernel_type::shared,m:8192,n:8192,k:8192}/manual_time                          740 ms          738 ms            1 TFLOPS=1.48592 bytes_per_second=518.953Mi/s
{hgemm:kernel_type::wmma_naive,m:1024,n:1024,k:1024}/manual_time                     1.64 ms         1.66 ms          428 TFLOPS=1.31435 bytes_per_second=3.57998Gi/s
{hgemm:kernel_type::wmma_naive,m:2048,n:2048,k:2048}/manual_time                     6.26 ms         6.28 ms          110 TFLOPS=2.7443 bytes_per_second=3.74117Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:4096,k:4096}/manual_time                     51.2 ms         51.1 ms           14 TFLOPS=2.68645 bytes_per_second=1.83229Gi/s
{hgemm:kernel_type::wmma_naive,m:8192,n:8192,k:8192}/manual_time                      561 ms          560 ms            1 TFLOPS=1.95871 bytes_per_second=684.071Mi/s
{hgemm:kernel_type::wmma_shared,m:1024,n:1024,k:1024}/manual_time                   0.371 ms        0.396 ms         1869 TFLOPS=5.78994 bytes_per_second=15.7969Gi/s
{hgemm:kernel_type::wmma_shared,m:2048,n:2048,k:2048}/manual_time                    2.41 ms         2.43 ms          290 TFLOPS=7.12085 bytes_per_second=9.71453Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:4096,k:4096}/manual_time                    20.9 ms         20.9 ms           33 TFLOPS=6.58286 bytes_per_second=4.49022Gi/s
{hgemm:kernel_type::wmma_shared,m:8192,n:8192,k:8192}/manual_time                     203 ms          202 ms            3 TFLOPS=5.4283 bytes_per_second=1.85138Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:1024,n:1024,k:1024}/manual_time              0.291 ms        0.316 ms         2409 TFLOPS=7.38357 bytes_per_second=20.145Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:2048,n:2048,k:2048}/manual_time               1.32 ms         1.34 ms          533 TFLOPS=13.04 bytes_per_second=17.7884Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:4096,k:4096}/manual_time               11.7 ms         11.7 ms           62 TFLOPS=11.7352 bytes_per_second=8.00412Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:8192,n:8192,k:8192}/manual_time               92.0 ms         91.9 ms            8 TFLOPS=11.945 bytes_per_second=4.07392Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:1024,n:1024,k:1024}/manual_time          0.286 ms        0.311 ms         2453 TFLOPS=7.52221 bytes_per_second=20.5224Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:2048,n:2048,k:2048}/manual_time           1.28 ms         1.30 ms          551 TFLOPS=13.4672 bytes_per_second=18.3709Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:4096,k:4096}/manual_time           11.0 ms         11.0 ms           62 TFLOPS=12.5114 bytes_per_second=8.53349Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:8192,n:8192,k:8192}/manual_time           90.2 ms         90.1 ms            8 TFLOPS=12.1848 bytes_per_second=4.1557Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:1024,n:1024,k:1024}/manual_time          0.115 ms        0.140 ms         6103 TFLOPS=18.7299 bytes_per_second=51.0904Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:2048,n:2048,k:2048}/manual_time          0.628 ms        0.653 ms         1123 TFLOPS=27.3497 bytes_per_second=37.3098Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:4096,k:4096}/manual_time           4.68 ms         4.70 ms          149 TFLOPS=29.3445 bytes_per_second=20.0157Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:8192,n:8192,k:8192}/manual_time           38.8 ms         38.7 ms           18 TFLOPS=28.3713 bytes_per_second=9.67599Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:1024,n:1024,k:1024}/manual_time      0.116 ms        0.142 ms         6041 TFLOPS=18.5036 bytes_per_second=50.4696Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:2048,n:2048,k:2048}/manual_time      0.631 ms        0.656 ms         1133 TFLOPS=27.2413 bytes_per_second=37.1623Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:4096,k:4096}/manual_time       4.40 ms         4.42 ms          159 TFLOPS=31.2033 bytes_per_second=21.2841Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:8192,n:8192,k:8192}/manual_time       33.5 ms         33.4 ms           21 TFLOPS=32.847 bytes_per_second=11.2025Gi/s
{hgemm:kernel_type::wmma_prefetch,m:1024,n:1024,k:1024}/manual_time                 0.114 ms        0.140 ms         6153 TFLOPS=18.8627 bytes_per_second=51.455Gi/s
{hgemm:kernel_type::wmma_prefetch,m:2048,n:2048,k:2048}/manual_time                 0.670 ms        0.696 ms         1048 TFLOPS=25.6291 bytes_per_second=34.9635Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:4096,k:4096}/manual_time                  4.50 ms         4.51 ms          157 TFLOPS=30.5757 bytes_per_second=20.8559Gi/s
{hgemm:kernel_type::wmma_prefetch,m:8192,n:8192,k:8192}/manual_time                  35.7 ms         35.7 ms           20 TFLOPS=30.7818 bytes_per_second=10.4956Gi/s
{hgemm:kernel_type::wmma_opt_1,m:1024,n:1024,k:1024}/manual_time                    0.103 ms        0.128 ms         6809 TFLOPS=20.8863 bytes_per_second=56.9768Gi/s
{hgemm:kernel_type::wmma_opt_1,m:2048,n:2048,k:2048}/manual_time                    0.574 ms        0.599 ms         1243 TFLOPS=29.9068 bytes_per_second=40.7969Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:4096,k:4096}/manual_time                     3.69 ms         3.70 ms          190 TFLOPS=37.2862 bytes_per_second=25.4283Gi/s
{hgemm:kernel_type::wmma_opt_1,m:8192,n:8192,k:8192}/manual_time                     34.9 ms         34.8 ms           21 TFLOPS=31.5538 bytes_per_second=10.7575Gi/s
{hgemm:kernel_type::wmma_opt_2,m:1024,n:1024,k:1024}/manual_time                    0.134 ms        0.159 ms         5241 TFLOPS=16.0747 bytes_per_second=43.8587Gi/s
{hgemm:kernel_type::wmma_opt_2,m:2048,n:2048,k:2048}/manual_time                    0.500 ms        0.525 ms         1402 TFLOPS=34.3686 bytes_per_second=46.8865Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:4096,k:4096}/manual_time                     3.43 ms         3.46 ms          204 TFLOPS=40.0346 bytes_per_second=27.3066Gi/s
{hgemm:kernel_type::wmma_opt_2,m:8192,n:8192,k:8192}/manual_time                     25.6 ms         25.6 ms           27 TFLOPS=42.9338 bytes_per_second=14.6387Gi/s
{hgemm:kernel_type::wmma_opt_3,m:1024,n:1024,k:1024}/manual_time                    0.148 ms        0.173 ms         4732 TFLOPS=14.5142 bytes_per_second=39.6011Gi/s
{hgemm:kernel_type::wmma_opt_3,m:2048,n:2048,k:2048}/manual_time                    0.551 ms        0.575 ms         1275 TFLOPS=31.188 bytes_per_second=42.5474Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:4096,k:4096}/manual_time                     3.71 ms         3.73 ms          189 TFLOPS=37.0698 bytes_per_second=25.2854Gi/s
{hgemm:kernel_type::wmma_opt_3,m:8192,n:8192,k:8192}/manual_time                     27.3 ms         27.3 ms           25 TFLOPS=40.2996 bytes_per_second=13.7412Gi/s
{hgemm:kernel_type::wmma_opt_4,m:1024,n:1024,k:1024}/manual_time                    0.132 ms        0.158 ms         5301 TFLOPS=16.2574 bytes_per_second=44.3565Gi/s
{hgemm:kernel_type::wmma_opt_4,m:2048,n:2048,k:2048}/manual_time                    0.492 ms        0.517 ms         1425 TFLOPS=34.8919 bytes_per_second=47.6001Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:4096,k:4096}/manual_time                     3.40 ms         3.42 ms          205 TFLOPS=40.3903 bytes_per_second=27.5505Gi/s
{hgemm:kernel_type::wmma_opt_4,m:8192,n:8192,k:8192}/manual_time                     25.2 ms         25.2 ms           28 TFLOPS=43.6517 bytes_per_second=14.8837Gi/s
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time                       0.352 ms        0.379 ms         1943 TFLOPS=6.10924 bytes_per_second=16.6634Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time                        2.83 ms         2.85 ms          250 TFLOPS=6.07286 bytes_per_second=8.27458Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time                        13.8 ms         13.8 ms           49 TFLOPS=9.98742 bytes_per_second=6.78644Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time                         102 ms          102 ms            6 TFLOPS=10.754 bytes_per_second=3.6613Gi/s

Well those rocblas numbers are sort of shit aren’t they? Let’s force hipblaslt:

$ ROCBLAS_USE_HIPBLASLT=1 hgemm/bench --benchmark_filter=rocblas
2025-05-25T00:03:24-07:00
Running hgemm/bench
Run on (32 X 1995.01 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.25, 0.23, 0.24
------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time      0.109 ms        0.135 ms         6420 TFLOPS=19.6586 bytes_per_second=53.6028Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time      0.600 ms        0.625 ms         1125 TFLOPS=28.6657 bytes_per_second=39.0436Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time       6.61 ms         6.62 ms          104 TFLOPS=20.7887 bytes_per_second=14.1747Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time        146 ms          145 ms            5 TFLOPS=7.57158 bytes_per_second=2.57652Gi/s

Slightly better… here’s the kicker though, make sure to install the gfx110x libs:

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 hgemm/bench --benchmark_filter=rocblas
2025-05-25T00:08:41-07:00
Running hgemm/bench
Run on (32 X 2020.2 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.29, 0.32, 0.28
------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time      0.063 ms        0.087 ms        11305 TFLOPS=34.3626 bytes_per_second=93.6554Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time      0.445 ms        0.472 ms         1577 TFLOPS=38.599 bytes_per_second=52.6245Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time       3.11 ms         3.14 ms          221 TFLOPS=44.1693 bytes_per_second=30.1057Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time       43.5 ms         43.5 ms           16 TFLOPS=25.2763 bytes_per_second=8.61588Gi/s

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ROCBLAS_USE_HIPBLASLT=1 hgemm/bench --benchmark_filter=rocblas
2025-05-25T00:09:43-07:00
Running hgemm/bench
Run on (32 X 2325.63 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.17, 0.29, 0.27
------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time      0.109 ms        0.135 ms         6246 TFLOPS=19.6989 bytes_per_second=53.7266Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time      0.600 ms        0.625 ms         1120 TFLOPS=28.6679 bytes_per_second=39.0454Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time       9.44 ms         9.44 ms           74 TFLOPS=14.5921 bytes_per_second=9.9338Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time        175 ms          174 ms            4 TFLOPS=6.29111 bytes_per_second=2.14501Gi/s

The best performance comes from using the gfx1100 rocblas (not hipblaslt) kernels.

Note, I filed an issue on this with previous tests and have added these as additional data points: [Issue]: gfx1151 rocBLAS/hipBLAS performance regression vs gfx1100 code path · Issue #4748 · ROCm/ROCm · GitHub

OK I add some more optim on my llama.cpp kernel:

(for now no FP8 it is WIP… and need more time) I manage to make tensor load faster :crossed_fingers: . But may be not perfect for the MAX @jasl123 if you can test this new branch that will be nice.

@lhl Nice bench.
Look like the APU optim is not the best for this one, look like dGPU kernel config is better… without a MAX hard to try tuning for it…

If someone want to test some tuning it can start to play with this line:

you can test with

            if (M%(4*4*BLOC_M0)==0) { sgemm_wmma<4,2,2,4,4,BLOC_K1>(A,B,C, M,N,K); return true; }
            if (M%(2*4*BLOC_M0)==0) { sgemm_wmma<2,2,2,4,4,BLOC_K1>(A,B,C, M,N,K); return true; }

and may be even:

            if (M%(4*4*BLOC_M0)==0) { sgemm_wmma<4,4,2,4,4,BLOC_K1>(A,B,C, M,N,K); return true; }
            if (M%(2*4*BLOC_M0)==0) { sgemm_wmma<2,2,2,4,4,BLOC_K1>(A,B,C, M,N,K); return true; }

My friend says AMD guys are working on the NPU backend, and I heard the NPU can access application memory (via PASID, which I’m not familiar with).

There is GitHub - ypapadop-amd/ggml at hsa-backend :crossed_fingers:
(but may be other…)
look like rocm 6.5 is on the way to …
And rocminfo (with ressent kernel/rocm) show the NPU/XDNA config…

Testing w/ modularai/Llama-3.1-8B-Instruct-GGUF/resolve/main/llama-3.1-8b-instruct-bf16.gguf

  • mmap=0 ever so slightly faster
  • fa=1 surprisingly faster except at a couple of the bigger pp’s which is surprising

have some deadlines that i’m procrastinating but will revisit in a bit

interestingly, the latest rocm nightly build seems to break upstream llama.cpp build but worked fine for your fork…

model size params backend ngl fa mmap test t/s
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp1 9.63 ± 0.03
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp2 18.58 ± 0.05
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp4 36.54 ± 0.41
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp8 62.41 ± 0.11
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp16 92.48 ± 0.28
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp32 134.98 ± 0.24
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp48 196.51 ± 0.16
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp64 261.79 ± 0.25
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp128 426.55 ± 1.70
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp192 514.44 ± 3.65
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp256 592.03 ± 1.80
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp384 580.35 ± 6.26
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp512 593.53 ± 12.58
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp768 509.45 ± 10.00
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp999 478.47 ± 6.25
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 pp1024 485.04 ± 6.94
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 0 tg16 9.64 ± 0.01
model size params backend ngl fa test t/s
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp1 9.62 ± 0.04
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp2 18.57 ± 0.07
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp4 36.82 ± 0.05
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp8 62.48 ± 0.15
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp16 92.58 ± 0.22
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp32 134.56 ± 0.17
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp48 197.95 ± 0.89
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp64 263.04 ± 0.71
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp128 423.13 ± 3.29
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp192 514.92 ± 2.64
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp256 587.19 ± 2.41
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp384 576.67 ± 4.49
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp512 570.91 ± 19.80
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp768 506.06 ± 7.66
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp999 477.09 ± 7.61
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 pp1024 474.48 ± 14.23
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 1 tg16 9.64 ± 0.02
model size params backend ngl mmap test t/s
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp1 9.63 ± 0.02
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp2 17.68 ± 0.06
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp4 34.93 ± 0.07
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp8 58.97 ± 0.16
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp16 88.25 ± 0.09
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp32 126.39 ± 0.08
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp48 187.11 ± 0.37
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp64 247.09 ± 1.75
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp128 398.49 ± 1.52
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp192 462.41 ± 2.99
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp256 541.48 ± 2.79
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp384 545.33 ± 4.12
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp512 576.00 ± 5.29
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp768 534.57 ± 2.54
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp999 530.55 ± 3.93
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 pp1024 540.28 ± 2.24
llama 8B BF16 14.96 GiB 8.03 B IGPU 99 0 tg16 9.60 ± 0.02
2 Likes

Which parts did you get for your oculink setup? I will probably be looking to connect my own 4090 as well.

Also, what are the issues that make you recommend against the X2 ?

Last but not least, can we please get a Passmark CPU benchmark at full wattage ? There are some around, but it’s always unclear if they’re reduced-power laptop or one of these boxes at 120+ W.

Which parts did you get for your oculink setup? I will probably be looking to connect my own 4090 as well.

I’m currently in China, so buying what I need from Taobao is easy. I use Minisforum DEG1 as the eGPU dock.

In my experience, Nvidia GPUs are generally good to use without any issue.
However, I met several issues with my 9070 XT.

The first one is that the eGPU randomly powers off, even in the BIOS settings.
To dig into the problem, I bought three M.2 to Oculink adapters from different shops.


The right two have no issue with Nvidia and AMD GPUs, but the left have trouble with AMD GPUs.
By the way, the right one is the best, and I can get 7.23GB/s bandwidth, which is very close to 8GB/s.

When I’m using the 9070 XT, there is a mysterious problem. Although the GPU TBP is 303W, it is pinned to 120W in software, which seems the same as the 8060s TDP, and there is no way to change it. AMD software only allows 10% power overclocking at most.

That is why I am posting this to ask the Framework team to help confirm that this issue doesn’t occur on the upcoming Framework Desktop.

Also, what are the issues that make you recommend against the X2 ?

  • Bad enclosure design
  • Only expose two PCIe x4 (it should be three)
  • Buggy BIOS (I’ve reported several issues)
  • Compatible issue with Linux (I don’t have any issue, but someone told me randomly glitch on Arch Linux)
  • Noisy fans

AFAIK, HP Zbook Ultra G1a is the best STXH laptop for now, everything is out-of-box.
So I assume the upcoming Z2 Mini G1a would be great as well, I may buy one to serve as home CI&CD server and sell my EVO-X2.

can we please get a Passmark CPU benchmark at full wattage

I can do it maybe tomorrow, I’ll post here once I’ve done it.