LLM Performance

Zetaphor · March 26, 2025, 9:10pm

Now that we have some real-world tests on the Strix Halo in inference engines like llama.cpp thanks to the release of the Asus Z13, can the Framework team give us some hard numbers on the Desktop?

I understand that support for this chipset, and ROCm support in general, is always developing. However now that we have numbers for a tablet I’d love to see some numbers for what that giant heatsink and higher TDP is capable of.

AI inference is the entire reason I’m interested in this device

Zetaphor · March 26, 2025, 9:11pm

Some numbers from the review done on the Z13

Morpheus636 · March 26, 2025, 10:42pm

Mobile CPUs are very dependant on the manufacturer’s configuration, to the point that they cannot really be compared device-to-device. The Z13 has a TDP of 40-80W. Framework Desktop is, to my knowledge, the only device using this chip to allow the full 120-240W power draw. Performance will likely be very different.

Alex_Morozov · March 27, 2025, 1:20am

Tokens/s is mostly proportional to memory bandwidth, and Strix Halo has about 3x bandwidth of current framework AMD boards (128 bit x 5200 vs 256 bit x 8000). That also matches with qwen numbers from the screenshot ^.

Prompt processing is said to be compute-bound, compared to 7940 and such it’ll be 12CU vs 40CU with each CU probably a bit faster, with the same watts per CU. I’d expect prompt processing to be 3-5 times faster.

Zetaphor · March 27, 2025, 2:31am

I feel like that was self evident by my post lol, I know it will be different. We can assume they’re going to be at least as fast as the numbers I posted above, almost ceratinly faster.

What I want to know is actual numbers, which we know they have as they’ve spoken multiple times about running a 70B at “conversational speed”

fish_177 · March 27, 2025, 11:30am

My suspicion (and the reason I haven’t pre ordered) is that the limited bandwidth is going to render the capacity pointless for inferencing - yes you’ll be able to load 70b q4 models in memory with a ton of context even, but numbers I’ve seen from Apple’s max chips with more than 2x the BW make me think 256GB/s is going to be unusably slow for >40GB models.

I’d love to be proven wrong, but I guess nobody has 128GB machine in the wild to test with yet.

Derek_Pressnall · March 27, 2025, 8:35pm

The number I’m looking for is a 70B model (40 GiB in size) using a 24 GiB GPU for more than half the model. It seems that the model speed is constrained by the slowest component, so that leaves 16 GiB running on the MB. 256 GiB/sec divided by 16 GiB spillover gives 16 tokens/sec or so. That isn’t too bad in my book.

Djip · March 27, 2025, 10:52pm

If you want some nomber.
On fw16 with only a 7940HS (no dGPU) and 64G of RAM I have:

model	size	params	backend	ngl	test	t/s
llama 8B Q4_K - Medium	4.94 GiB	8.03 B	Vulkan	999	pp512	221.17
llama 8B Q4_K - Medium	4.94 GiB	8.03 B	Vulkan	999	tg16	15.91
llama 8B Q4_K - Medium	4.94 GiB	8.03 B	CPU	-	pp512	87.73
llama 8B Q4_K - Medium	4.94 GiB	8.03 B	CPU	-	tg16	11.96
llama 70B Q4_K - Medium	40.32 GiB	70.55 B	CPU	-	pp512	9.16
llama 70B Q4_K - Medium	40.32 GiB	70.55 B	CPU	-	tg16	1.36

so what we can hop for the MAX:

8B can have ~ 45 tg/s (not 36 with used LM Studio and some optimisation).
70B: ~ 4,5 tg/s (human read at @5 word /s …)

for promp-processing I think you can have >70 token/s

Djip · March 27, 2025, 11:03pm

For now llama.cpp (no chose for Q4_K_M) when using the tensor not offload is copied to the GPU at runtime, in that case the speed for this part is the one from the PCIe bus (x8/x4?)
With 24G use 20G for the weight is short, (you need some more for KV / compute tensor) so no more than 1/2 of the layer will be completely offload.
and next for 1 token you need the time that use the GPU + the time need for the not offload, it can’t be done in parallele

256GiB/s it is a MAX as a GPU, do not add a GPU for that case it will be faster (the like is only a PCIe x4 to the GPU …

Djip · April 5, 2025, 9:01pm

New LLM benchmark database… can’t wait for AI-MAX result . for now I report some with FW16 CPU

Joe_Name · April 7, 2025, 1:20am

I was initially quite excited about inference on the new FW desktop - but after all the stuggles I’ve had with ROCm on the FW16 (complicated by NixOS) I thought I would wait and see. I am very encouraged by the work going on here.

Djip · April 8, 2025, 1:15am

You can use the vulkan backend (LM Studio/Llama.cpp) , simpler has rocm to use (look better with next Fedora 42 for rocm). With kernel >6.10 all GTT (ie 1/2 of the RAM by default) can be use so no probleme on Q8/Q6 for 8/12 with 32G of RAM, 24B with 64Go …

But even with Vulkan backeng I have some crache with big model from time to time.

I work on a special backend for llama.cpp with no limite other than RAM. but for now with BF16/FP16 quant, need more time to add FP8. I have some hop with it for good perf

Michael_Simmons · April 23, 2025, 4:14am

Please benchmark LLM’s that take more VRAM (shared ram) than what you can get on a discrete graphics card. Not at Q4 and not a distilled version of DeepSeek R1.

Use LM Studio on Ryzen AI or llama.cpp Vulkan.
Try QwQ 32B Q8 with various context windows and give us tokens per second.

AMD had a slide of <16GB which is really pointless because anyone buying a 128GB Strix Halo isn’t doing so for <16GB models they can just run heaps faster on a 16, 24 or 32Gb dGpu
And Q4 loses to much in the quantization. Unless maybe it’s using QAT like
google/gemma-3-27b-it-qat-q4_0-gguf

James3 · April 23, 2025, 7:47am

GTT by default takes 50% RAM.
You can adjust that up to 100% if you wish with config, but I normally do 80-90% so other bits,like the OS have some RAM.

waltercool · April 23, 2025, 5:08pm

Well, I do 14B models (~8-9GB) fairly well.

Not super speed, but good enough.

For anything beyond of that, I do use iGPU-only because dGPU+GTT or dGPU+iGPU is pretty bad.

Djip · April 23, 2025, 9:09pm

Yes I know. (that why I write 'by default")

And more, it is possible to allocate with UMA without limite on RAM (Not sure what is use for now on llama.cpp with Vulkan backend. For “CUDA” backend you need to build with -DGGML_HIP_UMA option)
And as I say my test backend use IHM and alloc all on RAM without limite. (The limit is that more RAM I use (or bigget the model is) the more I have crache… did not test with my last update (FC42+kernel6.14.2 for now…))

Alex_Morozov · April 24, 2025, 3:32pm

Same. On latest mainline kernels and ollama, loading >30G of weights is nearly guaranteed to crash (either driver crash or reboot). Drivers provided by AMD for 6.11 work about 10 times better, but still crash from time to time. Sometimes mem allocation starts failing until next boot.

I hope FW and AMD will figure it out for the AI platforms, otherwise it makes the AI part pointless.

Second_Coming · April 24, 2025, 11:29pm

Absolutely. The reliability to provide that use case has to be there, otherwise it’s just mute.

FW4TeePee · April 25, 2025, 12:39am

Yep, announcements are one thing, real-world is another

I’m very keen to see Max+ 395 products ship to testers and for them to report. And to hear AMD in June at their AI gig

Djip · April 25, 2025, 1:16am

There is some difference with the Max, AMD liste it (System requirements (Windows) — HIP SDK installation (Windows)) at least for windows as supported by rocm… this is not the case for the 7840HS/7940HS it can be enable so it work sometime but AMD do not have priority to solve the “bugs”.

And may be we need some more updated BIOS to on the FrameWork 16…

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	8	4281	September 20, 2025
Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b? Framework Desktop framework-desktop-ai-max-300 , ai	27	4398	August 24, 2025
FW13 AI 370 performance? Framework Laptop 13 framework-laptop-13-amd-ai-300	26	797	June 4, 2025
Help Me Make Up My Mind (FW13 Ryzen AI 9 HX 370) Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	18	2204	July 11, 2025
VRAM allocation for the 7840U frameworks Framework Laptop 13	27	11142	August 13, 2024

LLM Performance

Related topics