Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

A bit more of a followup since I was curious why there were reports of big performance (speed) difference between the unsloth/gpt-oss-120b-GGUF (61 GiB) and the ggml-org/gpt-oss-120b-GGUF (60 GiB). The ggml-org mode runs significantly faster:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 pp512 684.08 ± 5.42
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 tg128 41.52 ± 0.03
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-120b-F16.gguf
model size params backend ngl fa test t/s
gpt-oss ?B F16 60.87 GiB 116.83 B ROCm 99 1 pp512 642.59 ± 28.32
gpt-oss ?B F16 60.87 GiB 116.83 B ROCm 99 1 tg128 29.72 ± 0.03

These both claim to be MXFP4 quants, and they’re very close in size, so what’s going on?

The easiest way to spot the difference is using the new HF file info viewer:

There are a couple interesting differences (some might be artifacts of the file viewer, some like the tokenizer may make inferencing quality differences that need to be looked into), but the most important part is looking at the Tensors. Here you can see that the ggml-org model uses Q8_0 for the weights of the embedding layer vs F16 for Unsloth (this runs on CPU). The attention and output layers are also similar.

Life’s too short to download a lot of 120B models I’m never going to use, but I was curious enough to grab a few of the 20b’s just to do a sanity check. There’s definitely a speed hit, so how much faster do you actually get quantizing those layers? (Note the standard deviation on the pp512 was huge for some reason but all in the same ballpark so just focusing on tg128:

model size test t/s
ggml-org gpt-oss-20b MXFP4 11.27 GiB tg128 62.15 ± 0.01
unsloth gpt-oss-20b F16 12.83 GiB tg128 42.93 ± 0.00
unsloth gpt-oss-20b UD Q8_K_XL 12.28 GiB tg128 50.89 ± 0.00
unsloth gpt-oss-20b Q8_0 11.27 GiB tg128 59.06 ± 0.00
unsloth gpt-oss-20b UD Q4_K_XL 11.04 GiB tg128 62.04 ± 0.01
unsloth gpt-oss-20b Q4_K_M 10.81 GiB tg128 62.21 ± 0.01

So this is interesting, about as expected except that the ggml-org and the unsloth Q8_0 should basically be the exact same quants. What’s going on? So I reran them in reverse order with more repetitions, spaced a few minutes apart.

This brought the numbers much closer, good enough for me:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-20b-Q8_0.gguf -r 20
model size params backend ngl fa test t/s
gpt-oss ?B Q8_0 11.27 GiB 20.91 B ROCm 99 1 pp512 1411.20 ± 62.65
gpt-oss ?B Q8_0 11.27 GiB 20.91 B ROCm 99 1 tg128 61.68 ± 0.01
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -r 20
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 pp512 1429.01 ± 12.75
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 tg128 61.96 ± 0.01

Now as to the quality performance these different quants have? That needs to be empirically tested by someone who actually cares about using these models.

Truthfully, I’m not very satisfied by the common PPL and KLD numbers used to profile quant quality as I feel like they only very roughly represent actual downstream task performance. I’m much more of a functional eval guy. For my Shisa V2 405B model for example, I ran JA MT-Bench as a proxy for my use cases and suprisingly, IQ3_M, Q4_K_M, and Q8_0 (and my preferred W8A8-INT8 for production serving) were all pretty close: shisa-ai/shisa-v2-llama3.1-405b-GGUF · Hugging Face

Of course, this says nothing about the actual quality differences for a tiny parameter MoE, just feel like it might be a useful (primarily methodological) data point for anyone who does want to embark on quality testing.

5 Likes