LLM Performance

Now that we have some real-world tests on the Strix Halo in inference engines like llama.cpp thanks to the release of the Asus Z13, can the Framework team give us some hard numbers on the Desktop?

I understand that support for this chipset, and ROCm support in general, is always developing. However now that we have numbers for a tablet I’d love to see some numbers for what that giant heatsink and higher TDP is capable of.

AI inference is the entire reason I’m interested in this device

2 Likes

Some numbers from the review done on the Z13

Mobile CPUs are very dependant on the manufacturer’s configuration, to the point that they cannot really be compared device-to-device. The Z13 has a TDP of 40-80W. Framework Desktop is, to my knowledge, the only device using this chip to allow the full 120-240W power draw. Performance will likely be very different.

Tokens/s is mostly proportional to memory bandwidth, and Strix Halo has about 3x bandwidth of current framework AMD boards (128 bit x 5200 vs 256 bit x 8000). That also matches with qwen numbers from the screenshot ^.

Prompt processing is said to be compute-bound, compared to 7940 and such it’ll be 12CU vs 40CU with each CU probably a bit faster, with the same watts per CU. I’d expect prompt processing to be 3-5 times faster.

1 Like

I feel like that was self evident by my post lol, I know it will be different. We can assume they’re going to be at least as fast as the numbers I posted above, almost ceratinly faster.

What I want to know is actual numbers, which we know they have as they’ve spoken multiple times about running a 70B at “conversational speed”

1 Like

My suspicion (and the reason I haven’t pre ordered) is that the limited bandwidth is going to render the capacity pointless for inferencing - yes you’ll be able to load 70b q4 models in memory with a ton of context even, but numbers I’ve seen from Apple’s max chips with more than 2x the BW make me think 256GB/s is going to be unusably slow for >40GB models.

I’d love to be proven wrong, but I guess nobody has 128GB machine in the wild to test with yet.

The number I’m looking for is a 70B model (40 GiB in size) using a 24 GiB GPU for more than half the model. It seems that the model speed is constrained by the slowest component, so that leaves 16 GiB running on the MB. 256 GiB/sec divided by 16 GiB spillover gives 16 tokens/sec or so. That isn’t too bad in my book.

If you want some nomber.
On fw16 with only a 7940HS (no dGPU) and 64G of RAM I have:

model size params backend ngl test t/s
llama 8B Q4_K - Medium 4.94 GiB 8.03 B Vulkan 999 pp512 221.17
llama 8B Q4_K - Medium 4.94 GiB 8.03 B Vulkan 999 tg16 15.91
llama 8B Q4_K - Medium 4.94 GiB 8.03 B CPU - pp512 87.73
llama 8B Q4_K - Medium 4.94 GiB 8.03 B CPU - tg16 11.96
llama 70B Q4_K - Medium 40.32 GiB 70.55 B CPU - pp512 9.16
llama 70B Q4_K - Medium 40.32 GiB 70.55 B CPU - tg16 1.36

so what we can hop for the MAX:

  • 8B can have ~ 45 tg/s (not 36 with used LM Studio and some optimisation).
  • 70B: ~ 4,5 tg/s (human read at @5 word /s …)

for promp-processing I think you can have >70 token/s

1 Like

For now llama.cpp (no chose for Q4_K_M) when using the tensor not offload is copied to the GPU at runtime, in that case the speed for this part is the one from the PCIe bus (x8/x4?)
With 24G use 20G for the weight is short, (you need some more for KV / compute tensor) so no more than 1/2 of the layer will be completely offload.
and next for 1 token you need the time that use the GPU + the time need for the not offload, it can’t be done in parallele :wink:

256GiB/s it is a MAX as a GPU, do not add a GPU for that case it will be faster :wink: (the like is only a PCIe x4 to the GPU …

New LLM benchmark database… can’t wait for AI-MAX result . for now I report some with FW16 CPU :wink:

1 Like

I was initially quite excited about inference on the new FW desktop - but after all the stuggles I’ve had with ROCm on the FW16 (complicated by NixOS) I thought I would wait and see. I am very encouraged by the work going on here.

You can use the vulkan backend (LM Studio/Llama.cpp) , simpler has rocm to use (look better with next Fedora 42 for rocm). With kernel >6.10 all GTT (ie 1/2 of the RAM by default) can be use so no probleme on Q8/Q6 for 8/12 with 32G of RAM, 24B with 64Go …

But even with Vulkan backeng I have some crache with big model from time to time.

I work on a special backend for llama.cpp with no limite other than RAM. but for now with BF16/FP16 quant, need more time to add FP8. I have some hop with it for good perf :crossed_fingers:

2 Likes

Please benchmark LLM’s that take more VRAM (shared ram) than what you can get on a discrete graphics card. Not at Q4 and not a distilled version of DeepSeek R1.

Use LM Studio on Ryzen AI or llama.cpp Vulkan.
Try QwQ 32B Q8 with various context windows and give us tokens per second.

AMD had a slide of <16GB which is really pointless because anyone buying a 128GB Strix Halo isn’t doing so for <16GB models they can just run heaps faster on a 16, 24 or 32Gb dGpu
And Q4 loses to much in the quantization. Unless maybe it’s using QAT like
google/gemma-3-27b-it-qat-q4_0-gguf

GTT by default takes 50% RAM.
You can adjust that up to 100% if you wish with config, but I normally do 80-90% so other bits,like the OS have some RAM.

Well, I do 14B models (~8-9GB) fairly well.

Not super speed, but good enough.

For anything beyond of that, I do use iGPU-only because dGPU+GTT or dGPU+iGPU is pretty bad.

Yes I know. (that why I write 'by default")

And more, it is possible to allocate with UMA without limite on RAM (Not sure what is use for now on llama.cpp with Vulkan backend. For “CUDA” backend you need to build with -DGGML_HIP_UMA option)
And as I say my test backend use IHM and alloc all on RAM without limite. (The limit is that more RAM I use (or bigget the model is) the more I have crache… did not test with my last update (FC42+kernel6.14.2 for now…))

Same. On latest mainline kernels and ollama, loading >30G of weights is nearly guaranteed to crash (either driver crash or reboot). Drivers provided by AMD for 6.11 work about 10 times better, but still crash from time to time. Sometimes mem allocation starts failing until next boot.

I hope FW and AMD will figure it out for the AI platforms, otherwise it makes the AI part pointless.

1 Like

Absolutely. The reliability to provide that use case has to be there, otherwise it’s just mute.

Yep, announcements are one thing, real-world is another

I’m very keen to see Max+ 395 products ship to testers and for them to report. And to hear AMD in June at their AI gig

There is some difference with the Max, AMD liste it (System requirements (Windows) — HIP SDK installation (Windows)) at least for windows as supported by rocm… this is not the case for the 7840HS/7940HS it can be enable so it work sometime but AMD do not have priority to solve the “bugs”.

And may be we need some more updated BIOS to on the FrameWork 16…