Now that we have some real-world tests on the Strix Halo in inference engines like llama.cpp thanks to the release of the Asus Z13, can the Framework team give us some hard numbers on the Desktop?
I understand that support for this chipset, and ROCm support in general, is always developing. However now that we have numbers for a tablet I’d love to see some numbers for what that giant heatsink and higher TDP is capable of.
AI inference is the entire reason I’m interested in this device
Mobile CPUs are very dependant on the manufacturer’s configuration, to the point that they cannot really be compared device-to-device. The Z13 has a TDP of 40-80W. Framework Desktop is, to my knowledge, the only device using this chip to allow the full 120-240W power draw. Performance will likely be very different.
Tokens/s is mostly proportional to memory bandwidth, and Strix Halo has about 3x bandwidth of current framework AMD boards (128 bit x 5200 vs 256 bit x 8000). That also matches with qwen numbers from the screenshot ^.
Prompt processing is said to be compute-bound, compared to 7940 and such it’ll be 12CU vs 40CU with each CU probably a bit faster, with the same watts per CU. I’d expect prompt processing to be 3-5 times faster.
I feel like that was self evident by my post lol, I know it will be different. We can assume they’re going to be at least as fast as the numbers I posted above, almost ceratinly faster.
What I want to know is actual numbers, which we know they have as they’ve spoken multiple times about running a 70B at “conversational speed”
My suspicion (and the reason I haven’t pre ordered) is that the limited bandwidth is going to render the capacity pointless for inferencing - yes you’ll be able to load 70b q4 models in memory with a ton of context even, but numbers I’ve seen from Apple’s max chips with more than 2x the BW make me think 256GB/s is going to be unusably slow for >40GB models.
I’d love to be proven wrong, but I guess nobody has 128GB machine in the wild to test with yet.
The number I’m looking for is a 70B model (40 GiB in size) using a 24 GiB GPU for more than half the model. It seems that the model speed is constrained by the slowest component, so that leaves 16 GiB running on the MB. 256 GiB/sec divided by 16 GiB spillover gives 16 tokens/sec or so. That isn’t too bad in my book.
For now llama.cpp (no chose for Q4_K_M) when using the tensor not offload is copied to the GPU at runtime, in that case the speed for this part is the one from the PCIe bus (x8/x4?)
With 24G use 20G for the weight is short, (you need some more for KV / compute tensor) so no more than 1/2 of the layer will be completely offload.
and next for 1 token you need the time that use the GPU + the time need for the not offload, it can’t be done in parallele
256GiB/s it is a MAX as a GPU, do not add a GPU for that case it will be faster (the like is only a PCIe x4 to the GPU …
I was initially quite excited about inference on the new FW desktop - but after all the stuggles I’ve had with ROCm on the FW16 (complicated by NixOS) I thought I would wait and see. I am very encouraged by the work going on here.
You can use the vulkan backend (LM Studio/Llama.cpp) , simpler has rocm to use (look better with next Fedora 42 for rocm). With kernel >6.10 all GTT (ie 1/2 of the RAM by default) can be use so no probleme on Q8/Q6 for 8/12 with 32G of RAM, 24B with 64Go …
But even with Vulkan backeng I have some crache with big model from time to time.
I work on a special backend for llama.cpp with no limite other than RAM. but for now with BF16/FP16 quant, need more time to add FP8. I have some hop with it for good perf