Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

Claudia · August 9, 2025, 1:44pm

the gpt-oss-120b model is also misnamed since it’s 117B parameters

117 vs. 120 is still bakers-dozen or rounding-error territory in my books.
Being off by a factor of 4 is way past the line though.

_David · August 9, 2025, 2:05pm

For those who have had a chance to test gpt-oss-120b on the AI Max+ 395 (128 GB), I’d greatly appreciate it if you could share the largest context size you’ve successfully run, and how different context lengths have influenced the speeds.
I’m particularly interested in this for coding use cases, large context windows can make a big difference.

lhl · August 9, 2025, 2:35pm

I asked 5pro to look up why it’s named F16:

Short version: they renamed it “F16” for compatibility/visibility, but it’s still the native MXFP4‑MOE build.

On the Unsloth model page they explicitly say: “This is the MXFP4_MOE quant, now called F16, with our fixes.” Hugging Face

In Unsloth’s own discussion, a team member explains why: “We named it F16 so it can appear on the HF repo page but yes, it’s mostly the same.” (they’d briefly published it as ...-MXFP4.gguf; the SHA256 matched after the rename). Hugging Face

What’s going on under the hood

GPT‑OSS was trained with native MXFP4 precision for the MoE expert matmuls. So even the “F16” GGUF isn’t a classic full‑FP16 dump; it’s the native format (MXFP4 for MoE) plus other parts in higher precision. That’s why the so‑called “F16” file is ~65.4 GB for 120B—way smaller than a true FP16 of all 117B weights would be. Hugging Face+1

Early on, many front‑ends and the HF UI didn’t yet special‑case the new MXFP4/MOE tensor type. Some tools even key logic off the filename; one user noted renaming to ...f16.gguf made their frontend stop erroring. Hence the pragmatic “F16” label. Hugging Face

llama.cpp added support for native MXFP4 across backends around Aug 5, 2025, which is why you’ll also see the ggml/llama.cpp announcement and new quant types referenced. GitHub

So the “F16” tag on Unsloth’s GPT‑OSS GGUFs is a compatibility alias, not a statement that the MoE experts are stored as pure FP16. If you want the “native” OpenAI‑style build, use their “F16” file—just interpret “F16” here as “native MXFP4‑MOE + Unsloth chat‑template/precision fixes,” not “fully unquantized FP16 everywhere.” Hugging Face+1

FW4TeePee · August 10, 2025, 2:40am

I wonder when we would see a Medusa RDNA 3.5 (or 4.0?) APU motherboard with DDR6 actually ship? 2027 or 2028?

geerlingguy · August 10, 2025, 8:13pm

It’s like DeWalt and their ‘MAX 20V’ batteries that are actually 18V nominal, 120b just sounds more marketing-y than 117b

lhl · August 12, 2025, 4:31am

A bit more of a followup since I was curious why there were reports of big performance (speed) difference between the unsloth/gpt-oss-120b-GGUF (61 GiB) and the ggml-org/gpt-oss-120b-GGUF (60 GiB). The ggml-org mode runs significantly faster:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512	684.08 ± 5.42
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128	41.52 ± 0.03

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-120b-F16.gguf

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B F16	60.87 GiB	116.83 B	ROCm	99	1	pp512	642.59 ± 28.32
gpt-oss ?B F16	60.87 GiB	116.83 B	ROCm	99	1	tg128	29.72 ± 0.03

These both claim to be MXFP4 quants, and they’re very close in size, so what’s going on?

The easiest way to spot the difference is using the new HF file info viewer:

There are a couple interesting differences (some might be artifacts of the file viewer, some like the tokenizer may make inferencing quality differences that need to be looked into), but the most important part is looking at the Tensors. Here you can see that the ggml-org model uses Q8_0 for the weights of the embedding layer vs F16 for Unsloth (this runs on CPU). The attention and output layers are also similar.

Life’s too short to download a lot of 120B models I’m never going to use, but I was curious enough to grab a few of the 20b’s just to do a sanity check. There’s definitely a speed hit, so how much faster do you actually get quantizing those layers? (Note the standard deviation on the pp512 was huge for some reason but all in the same ballpark so just focusing on tg128:

model	size	test	t/s
ggml-org gpt-oss-20b MXFP4	11.27 GiB	tg128	62.15 ± 0.01
unsloth gpt-oss-20b F16	12.83 GiB	tg128	42.93 ± 0.00
unsloth gpt-oss-20b UD Q8_K_XL	12.28 GiB	tg128	50.89 ± 0.00
unsloth gpt-oss-20b Q8_0	11.27 GiB	tg128	59.06 ± 0.00
unsloth gpt-oss-20b UD Q4_K_XL	11.04 GiB	tg128	62.04 ± 0.01
unsloth gpt-oss-20b Q4_K_M	10.81 GiB	tg128	62.21 ± 0.01

So this is interesting, about as expected except that the ggml-org and the unsloth Q8_0 should basically be the exact same quants. What’s going on? So I reran them in reverse order with more repetitions, spaced a few minutes apart.

This brought the numbers much closer, good enough for me:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-20b-Q8_0.gguf -r 20

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B Q8_0	11.27 GiB	20.91 B	ROCm	99	1	pp512	1411.20 ± 62.65
gpt-oss ?B Q8_0	11.27 GiB	20.91 B	ROCm	99	1	tg128	61.68 ± 0.01

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -r 20

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512	1429.01 ± 12.75
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128	61.96 ± 0.01

Now as to the quality performance these different quants have? That needs to be empirically tested by someone who actually cares about using these models.

Truthfully, I’m not very satisfied by the common PPL and KLD numbers used to profile quant quality as I feel like they only very roughly represent actual downstream task performance. I’m much more of a functional eval guy. For my Shisa V2 405B model for example, I ran JA MT-Bench as a proxy for my use cases and suprisingly, IQ3_M, Q4_K_M, and Q8_0 (and my preferred W8A8-INT8 for production serving) were all pretty close: shisa-ai/shisa-v2-llama3.1-405b-GGUF · Hugging Face

Of course, this says nothing about the actual quality differences for a tiny parameter MoE, just feel like it might be a useful (primarily methodological) data point for anyone who does want to embark on quality testing.

Claudia · August 24, 2025, 12:06am

Thanks for all the responses!

With a little experimenting, I got the 120B model working just fine on my Arch Linux installation on the AI Max+ 395.

Getting llama.cpp working

Here’s how I got llama.cpp running on Arch Linux and a vanilla kernel:

Install vulkan-radeon and llama.cpp-vulkan-git^AUR.
Also amdgpu_top for troubleshooting.
Download a model (I was lazy and decided to reuse the one that I had already downloaded via LM Studio, bartowski/openai_gpt-oss-120b-GGUF).

Run the shell command line:

llama-server \
  -m path/to/openai_gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
  -c 16384 -t 16 -ngl 99

and point the browser to port 8080.

Works even with the linux-lts kernel (currently 6.12.43).

If I interpret the output of amdgpu_top correctly, then upstream Llama.cpp allocates much smaller chunks than the fork does, so raising ttm.pages_limit might not even be needed here.

I couldn’t get it to work with more than 16K tokens of context yet (8K with AMDVLK), but it’s more than enough for normal inference!

Inference speed is really snappy, definitely in the ballpark of 20–30 tokens/s that people have stated.

Lessons learned

This whole LLM field can get incredibly complex and overwhelming if you’re trying to accomplish a specific thing and it’s not working.
There seem to be two major (semi-)competing APIs: Vulkan and ROCm. Vulkan seems to be simple, slow, and good enough (for me), so I decided to go that route and ignore ROCm altogether.
LM Studio, the universally praised chat engine and frontend, doesn’t seem support Strix Halo yet on Linux. It doesn’t recognize the GPU at all. Is this yet another Vulkan vs. ROCm thing? I don’t know.
Llama.cpp is the only engine that worked out of the box with only very little tweaking.
For performance reasons, some people have suggested specialized forks, e.g. ik-llama.cpp. However, that fork hit weird assertions for me and crashed reproducibly. The fork got gpt-oss support only one week ago though, so I fully expect it to get better.
The ik-llama fork also tried to allocate all the memory for the tensors in one fell swoop, which then hit some kind of cap and crashed. I was able to fix this by setting the ttm.pages_limit kernel command-line parameter to a high enough value (e.g. 25165824 for 96 GiB).
People have been suggesting to set amdgpu.gttsize. The kernel docs say that it’s deprecated though, and it doesn’t even do anything if you have already set ttm.pages_limit.
Some even suggest to turn off the IOMMU via another kernel parameter but I decided not to.
LLMs are surprisingly skilled at troubleshooting LLMs.
Once I started to hit walls, I bounced my findings and problems off ChatGPT 5 and I think that it greatly helped, even though on occasion it still hallucinates configuration settings that don’t exist.

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	6957	September 29, 2025
VRAM allocation for the 7840U frameworks Framework Laptop 13	27	11258	August 13, 2024
Request: verify dGPU support Framework Desktop compatibility	145	4787	October 11, 2025
LLM Performance Framework Desktop ai	26	5211	June 11, 2025
LLM Benchmark (AMD 7840u) Linux	9	3392	January 5, 2025

Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

Getting llama.cpp working

Lessons learned

See also

Related topics