A bit more of a followup since I was curious why there were reports of big performance (speed) difference between the unsloth/gpt-oss-120b-GGUF (61 GiB) and the ggml-org/gpt-oss-120b-GGUF (60 GiB). The ggml-org mode runs significantly faster:
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | pp512 | 684.08 ± 5.42 |
| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | tg128 | 41.52 ± 0.03 |
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-120b-F16.gguf
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | pp512 | 642.59 ± 28.32 |
| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | tg128 | 29.72 ± 0.03 |
These both claim to be MXFP4 quants, and they’re very close in size, so what’s going on?
The easiest way to spot the difference is using the new HF file info viewer:
There are a couple interesting differences (some might be artifacts of the file viewer, some like the tokenizer may make inferencing quality differences that need to be looked into), but the most important part is looking at the Tensors. Here you can see that the ggml-org model uses Q8_0 for the weights of the embedding layer vs F16 for Unsloth (this runs on CPU). The attention and output layers are also similar.
Life’s too short to download a lot of 120B models I’m never going to use, but I was curious enough to grab a few of the 20b’s just to do a sanity check. There’s definitely a speed hit, so how much faster do you actually get quantizing those layers? (Note the standard deviation on the pp512 was huge for some reason but all in the same ballpark so just focusing on tg128:
| model | size | test | t/s |
|---|---|---|---|
| ggml-org gpt-oss-20b MXFP4 | 11.27 GiB | tg128 | 62.15 ± 0.01 |
| unsloth gpt-oss-20b F16 | 12.83 GiB | tg128 | 42.93 ± 0.00 |
| unsloth gpt-oss-20b UD Q8_K_XL | 12.28 GiB | tg128 | 50.89 ± 0.00 |
| unsloth gpt-oss-20b Q8_0 | 11.27 GiB | tg128 | 59.06 ± 0.00 |
| unsloth gpt-oss-20b UD Q4_K_XL | 11.04 GiB | tg128 | 62.04 ± 0.01 |
| unsloth gpt-oss-20b Q4_K_M | 10.81 GiB | tg128 | 62.21 ± 0.01 |
So this is interesting, about as expected except that the ggml-org and the unsloth Q8_0 should basically be the exact same quants. What’s going on? So I reran them in reverse order with more repetitions, spaced a few minutes apart.
This brought the numbers much closer, good enough for me:
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-20b-Q8_0.gguf -r 20
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| gpt-oss ?B Q8_0 | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 | 1411.20 ± 62.65 |
| gpt-oss ?B Q8_0 | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 | 61.68 ± 0.01 |
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -r 20
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 | 1429.01 ± 12.75 |
| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 | 61.96 ± 0.01 |
Now as to the quality performance these different quants have? That needs to be empirically tested by someone who actually cares about using these models.
Truthfully, I’m not very satisfied by the common PPL and KLD numbers used to profile quant quality as I feel like they only very roughly represent actual downstream task performance. I’m much more of a functional eval guy. For my Shisa V2 405B model for example, I ran JA MT-Bench as a proxy for my use cases and suprisingly, IQ3_M, Q4_K_M, and Q8_0 (and my preferred W8A8-INT8 for production serving) were all pretty close: shisa-ai/shisa-v2-llama3.1-405b-GGUF · Hugging Face
Of course, this says nothing about the actual quality differences for a tiny parameter MoE, just feel like it might be a useful (primarily methodological) data point for anyone who does want to embark on quality testing.