Request: verify dGPU support

Djip · May 25, 2025, 6:34pm

In my kernel I only compute the MatMul and only use pure HIP (no other rocm libs…)

Yes that strange.
can you test with use of BF16 (without FA) in KV: -ctk bf16 -ctv bf16 ?

JJ81 · May 25, 2025, 7:53pm

Thanks for the info! Looking forward to that CPU benchmark.

Regarding the 3x PCIe x4, I think only the Framework exposes all possible three of them, no? As I understand the Z2 has 2x m.2, then uses the other x4 for its I/O blocks. 10 GbE is available but otherwise still a loss of that x4. No details yet on the Thermalrights.

One m.2 slot and one x8 slot would’ve made me so happy, but apparently the APU doesn’t support it.

Djip · May 25, 2025, 8:46pm

I did not know why it is not supported. Look like this CPU have 16 pcie line, all usable, so look doable. (look like 4 hare use for aditional card … sound ? network…)

May be a base MB with 1 Nvme + 1 8x Pcie Slot + 1 4x Pcie Slot with noting else can be interesting too …

JJ81 · May 25, 2025, 9:47pm

In some thread somewhere, one of the motherboard engineers (unverified) shared the APU itself provides 4x x4, it’s not 1x x16 which is then further bifurcated. So AMD simply doesn’t provide that option for these chips.

I’ve not seen this officially confirmed anywhere else, though, but seeing as no device released or scheduled to be released provides anything else than blocks of x4 in various forms that I’ve seen, I’m inclined to believe it is so, until someone proves otherwise.

jasl123 · May 26, 2025, 7:36am

Here you go

I’m not sure this is full watt, because I see the whole PC spent less than 20W (about 13W most of the time) during the CPU test. It may be reasonable, as this is a single-core test (?), and I’m connecting a dGPU, so the iGPU is not in use.

And I believe the memory mark rate doesn’t look right.
AFAIK, the weakness of AI Max memory is high latency, and the CPU part’s read performance lesser than we thought (but still powerful), only about 120 GB/s (because each CCX memory read is 32 B/cycle, which is about 60GB/s in 2 GHz FCLK).

UPDATE: Just checked top list, fascinating!

jasl123 · May 26, 2025, 9:17am

It’s pre-allocated to three x4 channles in the SoC.
Consider the benchmark, I think that’s pure commercial consideration.
If it has a PCIe 4.0 x8, this would the best value CPU in the market.

JJ81 · May 26, 2025, 10:14am

Thanks for the benchmarks! I think actually your watt reading is probably incorrect, because those results are in line of what one would expect, and they are definitely multithreaded.

It puts this CPU between the 9900x and the 9950x, which is also somewhat expected. I had hoped it would be closer to the 9950x, but here we are. Not a bad result in any case - my outgoing ThreadRipper 2950x (also 16-core) scores less than half!

The memory mark rate needs further review indeed. For comparison, my quad-channel TR hits 2059, while I’ve found a benchmark for the 9950x with 96GB @ 6000 hitting 4256. I wonder if we’ll hit significantly different speeds (and particularly latencies) with different manufacturers… or are they all using the same RAM ? One needs to keep in mind 128gb is not attainable at these speeds on a 9950x (yet).

lhl · May 26, 2025, 6:14pm

OK, so -ctk bf16 -ctvbf16 w/o FA is actually a bit faster than the fastest run:

(base)  130 lhl@cluster4:~/llama.cpp/llama.cpp-djip007-igpu-fp8 (feature/igpu_fq8)$ build/igpu/bin/llama-bench -mmp 0 -ctk bf16 -ctv bf16 -m llama-3.1-8b-instruct-bf16.gguf -p 1,2,4,8,16,32,48,64,128,192,256,384,512,768,999,1024 -n 16
ggml-igpu: backend[IGPU] create
ggml-igpu: device[IGPU<0>::0] added: AMD Radeon Graphics (gfx1151)
| model                          |       size |     params | backend    | ngl | type_k | type_v | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | ---: | --------------: | -------------------: |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |             pp1 |          9.60 ± 0.02 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |             pp2 |         17.68 ± 0.02 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |             pp4 |         34.93 ± 0.06 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |             pp8 |         58.94 ± 0.11 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |            pp16 |         88.10 ± 0.13 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |            pp32 |        124.53 ± 0.08 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |            pp48 |        184.28 ± 0.47 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |            pp64 |        245.85 ± 0.57 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp128 |        394.83 ± 0.56 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp192 |        455.77 ± 1.26 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp256 |        534.55 ± 1.31 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp384 |        563.79 ± 4.93 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp512 |        607.44 ± 3.72 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp768 |        563.83 ± 2.46 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |           pp999 |        562.24 ± 3.79 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |          pp1024 |        569.28 ± 4.39 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |    0 |            tg16 |          9.59 ± 0.01 |

With -fa 1 and it’s a mess:

(base) lhl@cluster4:~/llama.cpp/llama.cpp-djip007-igpu-fp8 (feature/igpu_fq8)$ build/igpu/bin/llama-bench -mmp 0 -ctk bf16 -ctv bf16 -m llama-3.1-8b-instruct-bf16.gguf -p 1,2,4,8,16,32,48,64,128,192,256,384,512,768,999,1024 -n 16
-fa 1
ggml-igpu: backend[IGPU] create
ggml-igpu: device[IGPU<0>::0] added: AMD Radeon Graphics (gfx1151)
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ---: | --------------: | -------------------: |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |             pp1 |          9.64 ± 0.03 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |             pp2 |         18.47 ± 0.38 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |             pp4 |         36.94 ± 0.09 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |             pp8 |         62.55 ± 0.22 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |            pp16 |         92.45 ± 0.17 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |            pp32 |        134.28 ± 0.39 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |            pp48 |        194.18 ± 0.90 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |            pp64 |        253.63 ± 2.35 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp128 |       349.06 ± 10.39 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp192 |       386.08 ± 11.69 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp256 |       416.50 ± 14.13 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp384 |       298.40 ± 10.22 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp512 |       304.36 ± 10.75 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp768 |        220.98 ± 4.29 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |           pp999 |        182.81 ± 1.93 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |          pp1024 |        180.03 ± 2.89 |
| llama 8B BF16                  |  14.96 GiB |     8.03 B | IGPU       |  99 |   bf16 |   bf16 |  1 |    0 |            tg16 |          9.63 ± 0.02 |

build: 0a097a84 (5432)

Djip · May 26, 2025, 6:15pm

If you want to use a dGPU there is for me better with ryzen FireRange It has the same 16 zen5 core, can have 3D cache, low power and 24 Pcie5 lines. (and not 16 Pcie4…) …

jasl123 · May 26, 2025, 6:31pm

PCIe x4 is enough for AI, even for gaming.

I want the four-channel memory, which could super accelerate like compiling, just like Apple does.
And the efficiency is impressive for 7x24 running.

Djip · May 26, 2025, 6:32pm

Nice.
So I need to have a look on the qkv OP for faster speed. Not sure If have to implement the flash-attention or simply add mulmat for non weight OP…

Djip · May 26, 2025, 7:00pm

Well… Yes it can be nice to have a 4 RAM channel… But if I hop it, some bench may show that only the GPU can use it at full speed, the CPU look to have some other limite that will get only 120Gbps from RAM and not the full 256 Gbps…

I hop I am wrong, and it is only some NUMA config. (I can for exemple have ~80Gbit with the 8 core 7950HS so…)

jasl123 · May 26, 2025, 8:22pm

Yeah, it’s under our expectation… I posted this in my above reply.

AFAIK, the weakness of AI Max memory is high latency, and the CPU part’s read performance is less than we thought (but still powerful), only about 120 GB/s (because each CCX memory read is 32 B/cycle, which is about 60GB/s in 2 GHz FCLK).

However, the memory bandwidth-sensitive benchmarks are far better than 9950X.

David Huang also mentioned that the compiling speed (llama.cpp) is blazing fast, even under 60 watts. So do I.

Djip · May 27, 2025, 1:59am

If we can bench llama.cpp with BF16 and only CPU… we may have more clear idea what we can have …

JJ81 · May 27, 2025, 5:10am

The usable bandwidth being about half listed was to be expected - this is more or less true for all the AMD CPUs except for the ThreadRipper PRO and high-core EPYC.

I’d think the high latency (~1.5x) would influence compile times more than bandwidth, but apparently not?

Now what I would really like to see is a 128gb Halo vs a 128gb+ 9950x, as there the 9950x is either using slower RAM or 4 sticks, but so far I’ve been unable to find any benchmarks on the latter.

jasl123 · May 27, 2025, 8:18am

I’d think the high latency (~1.5x) would influence compile times more than bandwidth, but apparently not?

Latency does not affect compile time, but it would affect gaming.

Compiling llama.cpp
cmake -B build -S . -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100" -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_HIP_GRAPHS=ON && time cmake --build build --config Release -j$(nproc)

STXH 60 watts 1m47s,
9950X 120 watts 1m50s,
STXH 40 watts 2m7s

60 watts STXH can beat 120 watts 9950X.

You can also check Apple M Pro / Max chips as references.

JJ81 · May 27, 2025, 12:43pm

Thanks for those results!

Andreas_Kostyrka · July 29, 2025, 9:47pm

What is your definition of a bigger context, just curious.

FW4TeePee · August 15, 2025, 11:54pm

As Desktops ship I am curious to hear about any integration with eGPU and the benefits of these relative to clustering

And, as an aside, whether FW might be considering an addon for the Desktop to facilitate eGPU (kind of like this https://frame.work/au/en/products/16-graphics-module-amd-radeon-rx-7700s?v=FRAKMB0003 although I’m emphasizing “e” not “d” or “i” in GPU)

amp · August 16, 2025, 7:55am

I’m curious about this too, its probably the only thing keeping me from buying one. If i cant put a GPU its future is e-waste sooner than it should as it cant even compete against my desktop.

It doesnt look like the AI Max+ 395 iGPU can even handle 2x 4k monitors and 2x2k moitors at 144fps for 2025 normal desktop usage

Topic		Replies	Views
VRAM allocation for the 7840U frameworks Framework Laptop 13	27	11471	August 13, 2024
Look there is now build for rocm with official support for the iGPU (780M+?) Framework Laptop 16 framework-laptop-16-amd-7040 , framework-laptop-16-amd-ai-300 , graphics-module-amd-rx7700s	7	186	October 25, 2025
Help Me Make Up My Mind (FW13 Ryzen AI 9 HX 370) Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	18	2920	July 11, 2025
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	9940	September 29, 2025
AMD ROCm does not support the AMD Ryzen AI 300 Series GPUs Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	56	9649	October 21, 2025

Request: verify dGPU support

Related topics