In my kernel I only compute the MatMul and only use pure HIP (no other rocm libs…)
Yes that strange.
can you test with use of BF16 (without FA) in KV: -ctk bf16 -ctv bf16 ?
In my kernel I only compute the MatMul and only use pure HIP (no other rocm libs…)
Yes that strange.
can you test with use of BF16 (without FA) in KV: -ctk bf16 -ctv bf16 ?
Thanks for the info! Looking forward to that CPU benchmark.
Regarding the 3x PCIe x4, I think only the Framework exposes all possible three of them, no? As I understand the Z2 has 2x m.2, then uses the other x4 for its I/O blocks. 10 GbE is available but otherwise still a loss of that x4. No details yet on the Thermalrights.
One m.2 slot and one x8 slot would’ve made me so happy, but apparently the APU doesn’t support it.
I did not know why it is not supported. Look like this CPU have 16 pcie line, all usable, so look doable. (look like 4 hare use for aditional card … sound ? network…)
May be a base MB with 1 Nvme + 1 8x Pcie Slot + 1 4x Pcie Slot with noting else can be interesting too …
In some thread somewhere, one of the motherboard engineers (unverified) shared the APU itself provides 4x x4, it’s not 1x x16 which is then further bifurcated. So AMD simply doesn’t provide that option for these chips.
I’ve not seen this officially confirmed anywhere else, though, but seeing as no device released or scheduled to be released provides anything else than blocks of x4 in various forms that I’ve seen, I’m inclined to believe it is so, until someone proves otherwise.
Here you go
I’m not sure this is full watt, because I see the whole PC spent less than 20W (about 13W most of the time) during the CPU test. It may be reasonable, as this is a single-core test (?), and I’m connecting a dGPU, so the iGPU is not in use.
And I believe the memory mark rate doesn’t look right.
AFAIK, the weakness of AI Max memory is high latency, and the CPU part’s read performance lesser than we thought (but still powerful), only about 120 GB/s (because each CCX memory read is 32 B/cycle, which is about 60GB/s in 2 GHz FCLK).
UPDATE: Just checked top list, fascinating!
It’s pre-allocated to three x4 channles in the SoC.
Consider the benchmark, I think that’s pure commercial consideration.
If it has a PCIe 4.0 x8, this would the best value CPU in the market.
Thanks for the benchmarks! I think actually your watt reading is probably incorrect, because those results are in line of what one would expect, and they are definitely multithreaded.
It puts this CPU between the 9900x and the 9950x, which is also somewhat expected. I had hoped it would be closer to the 9950x, but here we are. Not a bad result in any case - my outgoing ThreadRipper 2950x (also 16-core) scores less than half!
The memory mark rate needs further review indeed. For comparison, my quad-channel TR hits 2059, while I’ve found a benchmark for the 9950x with 96GB @ 6000 hitting 4256. I wonder if we’ll hit significantly different speeds (and particularly latencies) with different manufacturers… or are they all using the same RAM ? One needs to keep in mind 128gb is not attainable at these speeds on a 9950x (yet).
OK, so -ctk bf16 -ctvbf16 w/o FA is actually a bit faster than the fastest run:
(base) 130 lhl@cluster4:~/llama.cpp/llama.cpp-djip007-igpu-fp8 (feature/igpu_fq8)$ build/igpu/bin/llama-bench -mmp 0 -ctk bf16 -ctv bf16 -m llama-3.1-8b-instruct-bf16.gguf -p 1,2,4,8,16,32,48,64,128,192,256,384,512,768,999,1024 -n 16
ggml-igpu: backend[IGPU] create
ggml-igpu: device[IGPU<0>::0] added: AMD Radeon Graphics (gfx1151)
| model | size | params | backend | ngl | type_k | type_v | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | ---: | --------------: | -------------------: |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp1 | 9.60 ± 0.02 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp2 | 17.68 ± 0.02 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp4 | 34.93 ± 0.06 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp8 | 58.94 ± 0.11 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp16 | 88.10 ± 0.13 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp32 | 124.53 ± 0.08 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp48 | 184.28 ± 0.47 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp64 | 245.85 ± 0.57 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp128 | 394.83 ± 0.56 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp192 | 455.77 ± 1.26 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp256 | 534.55 ± 1.31 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp384 | 563.79 ± 4.93 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp512 | 607.44 ± 3.72 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp768 | 563.83 ± 2.46 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp999 | 562.24 ± 3.79 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | pp1024 | 569.28 ± 4.39 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 0 | tg16 | 9.59 ± 0.01 |
With -fa 1 and it’s a mess:
(base) lhl@cluster4:~/llama.cpp/llama.cpp-djip007-igpu-fp8 (feature/igpu_fq8)$ build/igpu/bin/llama-bench -mmp 0 -ctk bf16 -ctv bf16 -m llama-3.1-8b-instruct-bf16.gguf -p 1,2,4,8,16,32,48,64,128,192,256,384,512,768,999,1024 -n 16
-fa 1
ggml-igpu: backend[IGPU] create
ggml-igpu: device[IGPU<0>::0] added: AMD Radeon Graphics (gfx1151)
| model | size | params | backend | ngl | type_k | type_v | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ---: | --------------: | -------------------: |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp1 | 9.64 ± 0.03 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp2 | 18.47 ± 0.38 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp4 | 36.94 ± 0.09 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp8 | 62.55 ± 0.22 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp16 | 92.45 ± 0.17 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp32 | 134.28 ± 0.39 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp48 | 194.18 ± 0.90 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp64 | 253.63 ± 2.35 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp128 | 349.06 ± 10.39 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp192 | 386.08 ± 11.69 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp256 | 416.50 ± 14.13 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp384 | 298.40 ± 10.22 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp512 | 304.36 ± 10.75 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp768 | 220.98 ± 4.29 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp999 | 182.81 ± 1.93 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | pp1024 | 180.03 ± 2.89 |
| llama 8B BF16 | 14.96 GiB | 8.03 B | IGPU | 99 | bf16 | bf16 | 1 | 0 | tg16 | 9.63 ± 0.02 |
build: 0a097a84 (5432)
If you want to use a dGPU there is for me better with ryzen FireRange It has the same 16 zen5 core, can have 3D cache, low power and 24 Pcie5 lines. (and not 16 Pcie4…) …
PCIe x4 is enough for AI, even for gaming.
I want the four-channel memory, which could super accelerate like compiling, just like Apple does.
And the efficiency is impressive for 7x24 running.
Nice.
So I need to have a look on the qkv OP for faster speed. Not sure If have to implement the flash-attention or simply add mulmat for non weight OP…
Well… Yes it can be nice to have a 4 RAM channel… But if I hop it, some bench may show that only the GPU can use it at full speed, the CPU look to have some other limite that will get only 120Gbps from RAM and not the full 256 Gbps…
I hop I am wrong, and it is only some NUMA config. (I can for exemple have ~80Gbit with the 8 core 7950HS so…) ![]()
Yeah, it’s under our expectation… I posted this in my above reply.
AFAIK, the weakness of AI Max memory is high latency, and the CPU part’s read performance is less than we thought (but still powerful), only about 120 GB/s (because each CCX memory read is 32 B/cycle, which is about 60GB/s in 2 GHz FCLK).
However, the memory bandwidth-sensitive benchmarks are far better than 9950X.
David Huang also mentioned that the compiling speed (llama.cpp) is blazing fast, even under 60 watts. So do I.
If we can bench llama.cpp with BF16 and only CPU… we may have more clear idea what we can have … ![]()
The usable bandwidth being about half listed was to be expected - this is more or less true for all the AMD CPUs except for the ThreadRipper PRO and high-core EPYC.
I’d think the high latency (~1.5x) would influence compile times more than bandwidth, but apparently not?
Now what I would really like to see is a 128gb Halo vs a 128gb+ 9950x, as there the 9950x is either using slower RAM or 4 sticks, but so far I’ve been unable to find any benchmarks on the latter.
I’d think the high latency (~1.5x) would influence compile times more than bandwidth, but apparently not?
Latency does not affect compile time, but it would affect gaming.
Compiling llama.cpp
cmake -B build -S . -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100" -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_HIP_GRAPHS=ON && time cmake --build build --config Release -j$(nproc)
STXH 60 watts 1m47s,
9950X 120 watts 1m50s,
STXH 40 watts 2m7s
60 watts STXH can beat 120 watts 9950X.
You can also check Apple M Pro / Max chips as references.
Thanks for those results!
What is your definition of a bigger context, just curious.
As Desktops ship I am curious to hear about any integration with eGPU and the benefits of these relative to clustering
And, as an aside, whether FW might be considering an addon for the Desktop to facilitate eGPU (kind of like this https://frame.work/au/en/products/16-graphics-module-amd-radeon-rx-7700s?v=FRAKMB0003 although I’m emphasizing “e” not “d” or “i” in GPU)
I’m curious about this too, its probably the only thing keeping me from buying one. If i cant put a GPU its future is e-waste sooner than it should as it cant even compete against my desktop.
It doesnt look like the AI Max+ 395 iGPU can even handle 2x 4k monitors and 2x2k moitors at 144fps for 2025 normal desktop usage