Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

lhl · August 12, 2025, 4:31am

A bit more of a followup since I was curious why there were reports of big performance (speed) difference between the unsloth/gpt-oss-120b-GGUF (61 GiB) and the ggml-org/gpt-oss-120b-GGUF (60 GiB). The ggml-org mode runs significantly faster:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512	684.08 ± 5.42
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128	41.52 ± 0.03

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-120b-F16.gguf

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B F16	60.87 GiB	116.83 B	ROCm	99	1	pp512	642.59 ± 28.32
gpt-oss ?B F16	60.87 GiB	116.83 B	ROCm	99	1	tg128	29.72 ± 0.03

These both claim to be MXFP4 quants, and they’re very close in size, so what’s going on?

The easiest way to spot the difference is using the new HF file info viewer:

There are a couple interesting differences (some might be artifacts of the file viewer, some like the tokenizer may make inferencing quality differences that need to be looked into), but the most important part is looking at the Tensors. Here you can see that the ggml-org model uses Q8_0 for the weights of the embedding layer vs F16 for Unsloth (this runs on CPU). The attention and output layers are also similar.

Life’s too short to download a lot of 120B models I’m never going to use, but I was curious enough to grab a few of the 20b’s just to do a sanity check. There’s definitely a speed hit, so how much faster do you actually get quantizing those layers? (Note the standard deviation on the pp512 was huge for some reason but all in the same ballpark so just focusing on tg128:

model	size	test	t/s
ggml-org gpt-oss-20b MXFP4	11.27 GiB	tg128	62.15 ± 0.01
unsloth gpt-oss-20b F16	12.83 GiB	tg128	42.93 ± 0.00
unsloth gpt-oss-20b UD Q8_K_XL	12.28 GiB	tg128	50.89 ± 0.00
unsloth gpt-oss-20b Q8_0	11.27 GiB	tg128	59.06 ± 0.00
unsloth gpt-oss-20b UD Q4_K_XL	11.04 GiB	tg128	62.04 ± 0.01
unsloth gpt-oss-20b Q4_K_M	10.81 GiB	tg128	62.21 ± 0.01

So this is interesting, about as expected except that the ggml-org and the unsloth Q8_0 should basically be the exact same quants. What’s going on? So I reran them in reverse order with more repetitions, spaced a few minutes apart.

This brought the numbers much closer, good enough for me:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-20b-Q8_0.gguf -r 20

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B Q8_0	11.27 GiB	20.91 B	ROCm	99	1	pp512	1411.20 ± 62.65
gpt-oss ?B Q8_0	11.27 GiB	20.91 B	ROCm	99	1	tg128	61.68 ± 0.01

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -r 20

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512	1429.01 ± 12.75
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128	61.96 ± 0.01

Now as to the quality performance these different quants have? That needs to be empirically tested by someone who actually cares about using these models.

Truthfully, I’m not very satisfied by the common PPL and KLD numbers used to profile quant quality as I feel like they only very roughly represent actual downstream task performance. I’m much more of a functional eval guy. For my Shisa V2 405B model for example, I ran JA MT-Bench as a proxy for my use cases and suprisingly, IQ3_M, Q4_K_M, and Q8_0 (and my preferred W8A8-INT8 for production serving) were all pretty close: shisa-ai/shisa-v2-llama3.1-405b-GGUF · Hugging Face

Of course, this says nothing about the actual quality differences for a tiny parameter MoE, just feel like it might be a useful (primarily methodological) data point for anyone who does want to embark on quality testing.

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	8378	September 29, 2025
VRAM allocation for the 7840U frameworks Framework Laptop 13	27	11327	August 13, 2024
Request: verify dGPU support Framework Desktop compatibility	179	5753	October 24, 2025
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	31	2088	October 20, 2025
LLM Performance Framework Desktop ai	26	5473	June 11, 2025

Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

Related topics