Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

the gpt-oss-120b model is also misnamed since it’s 117B parameters

117 vs. 120 is still bakers-dozen or rounding-error territory in my books.
Being off by a factor of 4 is way past the line though.

For those who have had a chance to test gpt-oss-120b on the AI Max+ 395 (128 GB), I’d greatly appreciate it if you could share the largest context size you’ve successfully run, and how different context lengths have influenced the speeds.
I’m particularly interested in this for coding use cases, large context windows can make a big difference.

1 Like

I asked 5pro to look up why it’s named F16:

Short version: they renamed it “F16” for compatibility/visibility, but it’s still the native MXFP4‑MOE build.

  • On the Unsloth model page they explicitly say: “This is the MXFP4_MOE quant, now called F16, with our fixes.” Hugging Face

  • In Unsloth’s own discussion, a team member explains why: “We named it F16 so it can appear on the HF repo page but yes, it’s mostly the same.” (they’d briefly published it as ...-MXFP4.gguf; the SHA256 matched after the rename). Hugging Face

What’s going on under the hood

  • GPT‑OSS was trained with native MXFP4 precision for the MoE expert matmuls. So even the “F16” GGUF isn’t a classic full‑FP16 dump; it’s the native format (MXFP4 for MoE) plus other parts in higher precision. That’s why the so‑called “F16” file is ~65.4 GB for 120B—way smaller than a true FP16 of all 117B weights would be. Hugging Face+1

  • Early on, many front‑ends and the HF UI didn’t yet special‑case the new MXFP4/MOE tensor type. Some tools even key logic off the filename; one user noted renaming to ...f16.gguf made their frontend stop erroring. Hence the pragmatic “F16” label. Hugging Face

  • llama.cpp added support for native MXFP4 across backends around Aug 5, 2025, which is why you’ll also see the ggml/llama.cpp announcement and new quant types referenced. GitHub

So the “F16” tag on Unsloth’s GPT‑OSS GGUFs is a compatibility alias, not a statement that the MoE experts are stored as pure FP16. If you want the “native” OpenAI‑style build, use their “F16” file—just interpret “F16” here as “native MXFP4‑MOE + Unsloth chat‑template/precision fixes,” not “fully unquantized FP16 everywhere.” Hugging Face+1

3 Likes

I wonder when we would see a Medusa RDNA 3.5 (or 4.0?) APU motherboard with DDR6 actually ship? 2027 or 2028?

It’s like DeWalt and their ‘MAX 20V’ batteries that are actually 18V nominal, 120b just sounds more marketing-y than 117b :smiley:

2 Likes

A bit more of a followup since I was curious why there were reports of big performance (speed) difference between the unsloth/gpt-oss-120b-GGUF (61 GiB) and the ggml-org/gpt-oss-120b-GGUF (60 GiB). The ggml-org mode runs significantly faster:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 pp512 684.08 ± 5.42
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 tg128 41.52 ± 0.03
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-120b-F16.gguf
model size params backend ngl fa test t/s
gpt-oss ?B F16 60.87 GiB 116.83 B ROCm 99 1 pp512 642.59 ± 28.32
gpt-oss ?B F16 60.87 GiB 116.83 B ROCm 99 1 tg128 29.72 ± 0.03

These both claim to be MXFP4 quants, and they’re very close in size, so what’s going on?

The easiest way to spot the difference is using the new HF file info viewer:

There are a couple interesting differences (some might be artifacts of the file viewer, some like the tokenizer may make inferencing quality differences that need to be looked into), but the most important part is looking at the Tensors. Here you can see that the ggml-org model uses Q8_0 for the weights of the embedding layer vs F16 for Unsloth (this runs on CPU). The attention and output layers are also similar.

Life’s too short to download a lot of 120B models I’m never going to use, but I was curious enough to grab a few of the 20b’s just to do a sanity check. There’s definitely a speed hit, so how much faster do you actually get quantizing those layers? (Note the standard deviation on the pp512 was huge for some reason but all in the same ballpark so just focusing on tg128:

model size test t/s
ggml-org gpt-oss-20b MXFP4 11.27 GiB tg128 62.15 ± 0.01
unsloth gpt-oss-20b F16 12.83 GiB tg128 42.93 ± 0.00
unsloth gpt-oss-20b UD Q8_K_XL 12.28 GiB tg128 50.89 ± 0.00
unsloth gpt-oss-20b Q8_0 11.27 GiB tg128 59.06 ± 0.00
unsloth gpt-oss-20b UD Q4_K_XL 11.04 GiB tg128 62.04 ± 0.01
unsloth gpt-oss-20b Q4_K_M 10.81 GiB tg128 62.21 ± 0.01

So this is interesting, about as expected except that the ggml-org and the unsloth Q8_0 should basically be the exact same quants. What’s going on? So I reran them in reverse order with more repetitions, spaced a few minutes apart.

This brought the numbers much closer, good enough for me:

❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m gpt-oss-20b-Q8_0.gguf -r 20
model size params backend ngl fa test t/s
gpt-oss ?B Q8_0 11.27 GiB 20.91 B ROCm 99 1 pp512 1411.20 ± 62.65
gpt-oss ?B Q8_0 11.27 GiB 20.91 B ROCm 99 1 tg128 61.68 ± 0.01
❯ ~/llama.cpp-lhl/build/bin/llama-bench -fa 1 -m ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -r 20
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 pp512 1429.01 ± 12.75
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 tg128 61.96 ± 0.01

Now as to the quality performance these different quants have? That needs to be empirically tested by someone who actually cares about using these models.

Truthfully, I’m not very satisfied by the common PPL and KLD numbers used to profile quant quality as I feel like they only very roughly represent actual downstream task performance. I’m much more of a functional eval guy. For my Shisa V2 405B model for example, I ran JA MT-Bench as a proxy for my use cases and suprisingly, IQ3_M, Q4_K_M, and Q8_0 (and my preferred W8A8-INT8 for production serving) were all pretty close: shisa-ai/shisa-v2-llama3.1-405b-GGUF · Hugging Face

Of course, this says nothing about the actual quality differences for a tiny parameter MoE, just feel like it might be a useful (primarily methodological) data point for anyone who does want to embark on quality testing.

5 Likes

Thanks for all the responses!

With a little experimenting, I got the 120B model working just fine on my Arch Linux installation on the AI Max+ 395.

Getting llama.cpp working

Here’s how I got llama.cpp running on Arch Linux and a vanilla kernel:

  1. Install vulkan-radeon and llama.cpp-vulkan-gitAUR.
    Also amdgpu_top for troubleshooting.

  2. Download a model (I was lazy and decided to reuse the one that I had already downloaded via LM Studio, bartowski/openai_gpt-oss-120b-GGUF).

  3. Run the shell command line:

    llama-server \
      -m path/to/openai_gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
      -c 16384 -t 16 -ngl 99
    

    and point the browser to port 8080.

Works even with the linux-lts kernel (currently 6.12.43).

If I interpret the output of amdgpu_top correctly, then upstream Llama.cpp allocates much smaller chunks than the fork does, so raising ttm.pages_limit might not even be needed here.

I couldn’t get it to work with more than 16K tokens of context yet (8K with AMDVLK), but it’s more than enough for normal inference!

Inference speed is really snappy, definitely in the ballpark of 20–30 tokens/s that people have stated.

Lessons learned

  • This whole LLM field can get incredibly complex and overwhelming if you’re trying to accomplish a specific thing and it’s not working.
  • There seem to be two major (semi-)competing APIs: Vulkan and ROCm. Vulkan seems to be simple, slow, and good enough (for me), so I decided to go that route and ignore ROCm altogether.
  • LM Studio, the universally praised chat engine and frontend, doesn’t seem support Strix Halo yet on Linux. It doesn’t recognize the GPU at all. Is this yet another Vulkan vs. ROCm thing? I don’t know.
  • Llama.cpp is the only engine that worked out of the box with only very little tweaking.
  • For performance reasons, some people have suggested specialized forks, e.g. ik-llama.cpp. However, that fork hit weird assertions for me and crashed reproducibly. The fork got gpt-oss support only one week ago though, so I fully expect it to get better.
  • The ik-llama fork also tried to allocate all the memory for the tensors in one fell swoop, which then hit some kind of cap and crashed. I was able to fix this by setting the ttm.pages_limit kernel command-line parameter to a high enough value (e.g. 25165824 for 96 GiB).
  • People have been suggesting to set amdgpu.gttsize. The kernel docs say that it’s deprecated though, and it doesn’t even do anything if you have already set ttm.pages_limit.
  • Some even suggest to turn off the IOMMU via another kernel parameter but I decided not to.
  • LLMs are surprisingly skilled at troubleshooting LLMs.
    Once I started to hit walls, I bounced my findings and problems off ChatGPT 5 and I think that it greatly helped, even though on occasion it still hallucinates configuration settings that don’t exist.

See also

7 Likes

Awesome that you got it working!

  • amd.gttsize is going away and ttm is the “right way” to set things now, but there may be legacy userland stuff that still references it, so for now doesn’t hurt to add it
  • amd_iommu=off benchmarks to 6% > memory reads, but only about 2% better tg performance. Of course if you need virtualization/passthrough, is a moot point
  • Setting accelerator-performance with tuned (this sets EPP 0 (performance), governor to performance, and locks the CPU to low C-states for lower latency. It also gives a MBW read bump, but notably, increases pp perf by about 5%.
  • If an older kernel works for you great and you should stick to it. If you run into problems, the latest Linux kernels and linux-firmware may be required to fix issues
  • For long context llama-server has prompt caching but only if you submit cache_prompt=true with your request. I don’t know if most inference front-ends do that. This will actually be your biggest performance boost in real-world usage, especially since the Vulkan backends typically tend to have lower pp.

BTW, for those looking for the simplest way on Linux or Windows for trying out llama.cpp with the ROCm backend, I can recommend https://lemonade-server.ai/ - while Vulkan rarely crashes, and some models work fine (like gpt-oss, most dense models) I’ve noticed a lot of “GGGG” output errors in Vulkan recently when testing some of the new Chinese MoE architectures.

6 Likes