the gpt-oss-120b model is also misnamed since it’s 117B parameters
117 vs. 120 is still bakers-dozen or rounding-error territory in my books.
Being off by a factor of 4 is way past the line though.
the gpt-oss-120b model is also misnamed since it’s 117B parameters
117 vs. 120 is still bakers-dozen or rounding-error territory in my books.
Being off by a factor of 4 is way past the line though.
For those who have had a chance to test gpt-oss-120b on the AI Max+ 395 (128 GB), I’d greatly appreciate it if you could share the largest context size you’ve successfully run, and how different context lengths have influenced the speeds.
I’m particularly interested in this for coding use cases, large context windows can make a big difference.
I asked 5pro to look up why it’s named F16:
Short version: they renamed it “F16” for compatibility/visibility, but it’s still the native MXFP4‑MOE build.
On the Unsloth model page they explicitly say: “This is the MXFP4_MOE quant, now called F16, with our fixes.” Hugging Face
In Unsloth’s own discussion, a team member explains why: “We named it F16 so it can appear on the HF repo page but yes, it’s mostly the same.” (they’d briefly published it as
...-MXFP4.gguf
; the SHA256 matched after the rename). Hugging FaceWhat’s going on under the hood
GPT‑OSS was trained with native MXFP4 precision for the MoE expert matmuls. So even the “F16” GGUF isn’t a classic full‑FP16 dump; it’s the native format (MXFP4 for MoE) plus other parts in higher precision. That’s why the so‑called “F16” file is ~65.4 GB for 120B—way smaller than a true FP16 of all 117B weights would be. Hugging Face+1
Early on, many front‑ends and the HF UI didn’t yet special‑case the new MXFP4/MOE tensor type. Some tools even key logic off the filename; one user noted renaming to
...f16.gguf
made their frontend stop erroring. Hence the pragmatic “F16” label. Hugging Facellama.cpp added support for native MXFP4 across backends around Aug 5, 2025, which is why you’ll also see the ggml/llama.cpp announcement and new quant types referenced. GitHub
So the “F16” tag on Unsloth’s GPT‑OSS GGUFs is a compatibility alias, not a statement that the MoE experts are stored as pure FP16. If you want the “native” OpenAI‑style build, use their “F16” file—just interpret “F16” here as “native MXFP4‑MOE + Unsloth chat‑template/precision fixes,” not “fully unquantized FP16 everywhere.” Hugging Face+1
I wonder when we would see a Medusa RDNA 3.5 (or 4.0?) APU motherboard with DDR6 actually ship? 2027 or 2028?
It’s like DeWalt and their ‘MAX 20V’ batteries that are actually 18V nominal, 120b just sounds more marketing-y than 117b
A bit more of a followup since I was curious why there were reports of big performance (speed) difference between the unsloth/gpt-oss-120b-GGUF (61 GiB) and the ggml-org/gpt-oss-120b-GGUF (60 GiB). The ggml-org mode runs significantly faster:
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf
model | size | params | backend | ngl | fa | test | t/s |
---|---|---|---|---|---|---|---|
gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | pp512 | 684.08 ± 5.42 |
gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | tg128 | 41.52 ± 0.03 |
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-120b-F16.gguf
model | size | params | backend | ngl | fa | test | t/s |
---|---|---|---|---|---|---|---|
gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | pp512 | 642.59 ± 28.32 |
gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | tg128 | 29.72 ± 0.03 |
These both claim to be MXFP4 quants, and they’re very close in size, so what’s going on?
The easiest way to spot the difference is using the new HF file info viewer:
There are a couple interesting differences (some might be artifacts of the file viewer, some like the tokenizer may make inferencing quality differences that need to be looked into), but the most important part is looking at the Tensors. Here you can see that the ggml-org model uses Q8_0 for the weights of the embedding layer vs F16 for Unsloth (this runs on CPU). The attention and output layers are also similar.
Life’s too short to download a lot of 120B models I’m never going to use, but I was curious enough to grab a few of the 20b’s just to do a sanity check. There’s definitely a speed hit, so how much faster do you actually get quantizing those layers? (Note the standard deviation on the pp512
was huge for some reason but all in the same ballpark so just focusing on tg128
:
model | size | test | t/s |
---|---|---|---|
ggml-org gpt-oss-20b MXFP4 | 11.27 GiB | tg128 | 62.15 ± 0.01 |
unsloth gpt-oss-20b F16 | 12.83 GiB | tg128 | 42.93 ± 0.00 |
unsloth gpt-oss-20b UD Q8_K_XL | 12.28 GiB | tg128 | 50.89 ± 0.00 |
unsloth gpt-oss-20b Q8_0 | 11.27 GiB | tg128 | 59.06 ± 0.00 |
unsloth gpt-oss-20b UD Q4_K_XL | 11.04 GiB | tg128 | 62.04 ± 0.01 |
unsloth gpt-oss-20b Q4_K_M | 10.81 GiB | tg128 | 62.21 ± 0.01 |
So this is interesting, about as expected except that the ggml-org and the unsloth Q8_0 should basically be the exact same quants. What’s going on? So I reran them in reverse order with more repetitions, spaced a few minutes apart.
This brought the numbers much closer, good enough for me:
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-20b-Q8_0.gguf -r 20
model | size | params | backend | ngl | fa | test | t/s |
---|---|---|---|---|---|---|---|
gpt-oss ?B Q8_0 | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 | 1411.20 ± 62.65 |
gpt-oss ?B Q8_0 | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 | 61.68 ± 0.01 |
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -r 20
model | size | params | backend | ngl | fa | test | t/s |
---|---|---|---|---|---|---|---|
gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 | 1429.01 ± 12.75 |
gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 | 61.96 ± 0.01 |
Now as to the quality performance these different quants have? That needs to be empirically tested by someone who actually cares about using these models.
Truthfully, I’m not very satisfied by the common PPL and KLD numbers used to profile quant quality as I feel like they only very roughly represent actual downstream task performance. I’m much more of a functional eval guy. For my Shisa V2 405B model for example, I ran JA MT-Bench as a proxy for my use cases and suprisingly, IQ3_M, Q4_K_M, and Q8_0 (and my preferred W8A8-INT8 for production serving) were all pretty close: shisa-ai/shisa-v2-llama3.1-405b-GGUF · Hugging Face
Of course, this says nothing about the actual quality differences for a tiny parameter MoE, just feel like it might be a useful (primarily methodological) data point for anyone who does want to embark on quality testing.
Thanks for all the responses!
With a little experimenting, I got the 120B model working just fine on my Arch Linux installation on the AI Max+ 395.
Here’s how I got llama.cpp running on Arch Linux and a vanilla kernel:
Install vulkan-radeon
and llama.cpp-vulkan-git
AUR.
Also amdgpu_top
for troubleshooting.
Download a model (I was lazy and decided to reuse the one that I had already downloaded via LM Studio, bartowski/openai_gpt-oss-120b-GGUF
).
Run the shell command line:
llama-server \
-m path/to/openai_gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
-c 16384 -t 16 -ngl 99
and point the browser to port 8080.
Works even with the linux-lts
kernel (currently 6.12.43).
If I interpret the output of amdgpu_top
correctly, then upstream Llama.cpp allocates much smaller chunks than the fork does, so raising ttm.pages_limit
might not even be needed here.
I couldn’t get it to work with more than 16K tokens of context yet (8K with AMDVLK), but it’s more than enough for normal inference!
Inference speed is really snappy, definitely in the ballpark of 20–30 tokens/s that people have stated.
ttm.pages_limit
kernel command-line parameter to a high enough value (e.g. 25165824
for 96 GiB).amdgpu.gttsize
. The kernel docs say that it’s deprecated though, and it doesn’t even do anything if you have already set ttm.pages_limit
.drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
in the kernel sourcesAwesome that you got it working!
amd.gttsize
is going away and ttm
is the “right way” to set things now, but there may be legacy userland stuff that still references it, so for now doesn’t hurt to add itamd_iommu=off
benchmarks to 6% > memory reads, but only about 2% better tg
performance. Of course if you need virtualization/passthrough, is a moot pointaccelerator-performance
with tuned
(this sets EPP 0 (performance), governor to performance, and locks the CPU to low C-states for lower latency. It also gives a MBW read bump, but notably, increases pp
perf by about 5%.llama-server
has prompt caching but only if you submit cache_prompt=true
with your request. I don’t know if most inference front-ends do that. This will actually be your biggest performance boost in real-world usage, especially since the Vulkan backends typically tend to have lower pp
.BTW, for those looking for the simplest way on Linux or Windows for trying out llama.cpp with the ROCm backend, I can recommend https://lemonade-server.ai/ - while Vulkan rarely crashes, and some models work fine (like gpt-oss, most dense models) I’ve noticed a lot of “GGGG” output errors in Vulkan recently when testing some of the new Chinese MoE architectures.