TBH I think I’m going to cancel my pre-order until some genuine numbers are released rather than speculate over X device doing X number of tokens a second. I’m sure Framework could get something working, even if they preface it with ‘hey this is just a basic test and not all features are fully working’, just give us something.
What should we test? Any ideas or suggestions?
For me – ollama run --verbose
on a few popular models of different sizes to check tokens/s and prompt eval rate. I’d like to compare 1:1 to mine 7940 machine; comparison to the recent 13" AI platforms would be also interesting.
The biggest thing in my opinion is probably models above 40GB in size - 70b q4 and up, maybe some 32b with 130k context.
This chip is being marketed hard as a device to run inferencing with models that won’t easily fit in consumer GPU vram, but basically nobody seems to be testing them with models above 13b parameters.
It would be nice to see if it’s actually fast enough to provide usable performance for this use case.
some element…
That page now says:
2025-05-30 UPDATE: I am now able to reveal that all my Strix Halo has been done on pre-release Framework Desktop systems. Per the published specs page, it is able to boost to 140W and sustain at 120W. I won’t be going deep into any hardware/system benchmarks (will leave it for others) but in my
llama-bench
runs it does not appear to thermal throttle.
Someone finally tested and put out results with large models for this SIP, although with a different manufacturer’s product:
He appears to be using lm studio / vulkan back end and getting about 5 tps with llama 70B q4 (40GB model). A little better than I expected actually, but below the threshold I personally would consider usable (for me that’s around 8 tps, obviously usable is subjective).
32B looks fine-ish though.