Framework Desktop for Local AI

Hi all, I am interested in getting a capable PC for local AI. The Framework Desktop seems to be a great choice for it

However, I have been looking at the forum, and it seems there are quite few compatibility issues. But I also see that in the latest releases, things seems to have improved: Linux + ROCm: January 2026 Stable Configurations Update

Do you think the Framework is worth getting for local AI (will most likely use it for only this purpose, as I have a laptop as a daily driver)

Or should I build my own PC out of RTX 3090 ?

Or should I wait for the AMD Ryzen AI Halo ?

Or should I get a Mac mini M4 Pro 64GB ?

Which are the most performant ?

And what do you think overall ?

Thanks!

I use 3 Strix Halo machines (one of which being Framework Desktop) in a cluster for local AI. And whether it’s worth getting or not over the other options kind of depends on which models you want to run. I was aiming for large LLMs like GLM (currently testing GLM 5 performance to see if I can replace 4.6 I’m mostly using) and for me it works great. It’s fast, the thermals are decent and I’m yet to have an issue (I’m using Fedora 43, so your mileage may vary).

1 Like

With the FD, and Strix Halo as an architecture what you get is:

  • A large(r) memory pool than a PC
  • That is FAR faster than you can get on a traditional PC–and can be used for either CPU or GPU compute. To get a PC with 100GB of GPU VRAM you’ll spend $10,000+USD.

But:

  • AMD AGESA code…has always been “bleeding edge” and has always had teething pains. Strix Halo now isn’t the newest–and bugs are more ironed than they were
  • AMD ROCm is in very very active development. Getting it to run and work is a project. It isn’t a set-it-and-forget-it and use it solution, yet.
  • CUDA is simply more mature…but you lose out on the pros above.

A Mac Mini would have nearly half the memory bandwidth (120GB/second vs. 200GB/s)–and you can’t get near as much memory at the top configuration, and also much less CPU compute–while costing more. Does that matter to your application? IDK. BUT–with the FD sure you have up to 128GB of memory but are you running models where you can use that pool of memory and still get acceptable token-rates?

Whereas a Mac Studio M4 would have double the memory bandwidth (500+GB/s) of the Framework Desktop, but to get the same amount of RAM would cost 50% more–because Apple pricing on memory and storage has always been extremely high because their memory and drives are fast.

2 Likes

Thanks for the replies!

My main use case would be local inference for development.

Any data on token/seconds for models like the latest qwen 3.5 27b with framework ?

@entropy4936 what are your other 2 Strix Halo ?

Both are Minisforum MS-S1 MAX. At the time I got them, they were actually cheaper that Framework Desktop (because of the launch promo deal), so I went with them, cancelling my preorder for Desktop I had back then

1 Like

I realize this late but maybe it’ll be useful. These are the results of benchmarking, and I get nearly identical tps using a test prompt.

Qwen 3.5 27B runs fairly slow:

$ llama-bench -m ./unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q4_K_XL.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 ?B Q4_K - Medium        |  16.40 GiB |    26.90 B | ROCm       |  99 |           pp512 |        299.82 ± 4.47 |
| qwen35 ?B Q4_K - Medium        |  16.40 GiB |    26.90 B | ROCm       |  99 |           tg128 |         10.60 ± 0.01 |

$ llama-bench -m ./unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q6_K_XL.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 ?B Q6_K                 |  23.90 GiB |    26.90 B | ROCm       |  99 |           pp512 |        250.13 ± 3.50 |
| qwen35 ?B Q6_K                 |  23.90 GiB |    26.90 B | ROCm       |  99 |           tg128 |          7.70 ± 0.00 |

$ llama-bench -m ./unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q8_K_XL.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 ?B Q8_0                 |  33.08 GiB |    26.90 B | ROCm       |  99 |           pp512 |        288.82 ± 4.64 |
| qwen35 ?B Q8_0                 |  33.08 GiB |    26.90 B | ROCm       |  99 |           tg128 |          5.94 ± 0.00 |

Qwen 3.5 35B-A3B runs much better:

$ llama-bench -m ./unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf 
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe ?B Q6_K              |  29.86 GiB |    34.66 B | ROCm       |  99 |           pp512 |        775.93 ± 3.43 |
| qwen35moe ?B Q6_K              |  29.86 GiB |    34.66 B | ROCm       |  99 |           tg128 |         36.91 ± 0.06 |

$ llama-bench -m ./unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf 
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe ?B Q8_0              |  45.33 GiB |    34.66 B | ROCm       |  99 |           pp512 |        609.43 ± 3.95 |
| qwen35moe ?B Q8_0              |  45.33 GiB |    34.66 B | ROCm       |  99 |           tg128 |         25.21 ± 0.01 |
1 Like

Thanks, that’s really useful

what are tg128 and pp512 ?

1 Like

pp512 is the rate of prompt processing of 512 tokens, so how long it takes to parse each token when the input is 512 tokens long.

tg128 is the rate of token generation of 128 tokens, so how long it takes to generate each token when 128 tokens are outputted.

Those are the defaults when running llama-bench.

Looks like i should give qwen 3.5 35B a got. trippling token generation is a great improvment.

I will come back on the weekend with some Win11 results.

1 Like