Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

OpenAI just dropped a pair of open models. However, I’m failing to understand what the system requirements for e.g. gpt-oss-120b are. The documentation mentions a H100 in the description, but I don’t really know what that means, how to compare it to the AI Max+ 395 (128GB), nor how much memory 120 billion parameters are going to require.

@catastrophic has written about a model with similarly many parameters, Llama 4 Scout 17B (109B vs. OpenAI’s 117B):

But maybe I have misunderstood or overlooked something, and I figure that gpt-oss-120b might require space for another 18 billion params compared to the Llama 4 one.
So what I’d like to understand is:

  1. Can I expect gpt-oss-120b to run somewhat decently on the 128GB Desktop?
  2. How much memory does the model likely require during inference?
1 Like

It’s a complex question. I’m downloading it right now. Speed wise, it all depend, you could have use case where your are fine with 1 t/s. What I can tell right now is that the model is released with what they call MXFP4 quantization. So the model is 63.39 GB on disk, it would fit on the Framework Desktop 128GB version for sure. It is using mixture of expert with 5.1B active parameters, so it should be quite fast for it’s size.

1 Like

I have done some easy testing with lm studio. RTX 3090 with 128GB DDR4 3600 Mhz, Ryzen 5600X.

GPT OSS 120B MXFP4 (63.39GB), 8-9 t/s

Llama 4 scout Q6_K (90.51GB) 3-3.5 t/s

1 Like

Thanks! According to this comparison, the 3090 and the Ryzen should be roughly in the same ballpark. So your data is a good estimate, I guess.

gpt-oss-120b does have more parameters, however OpenAI officially provides it in a format where most of the parameters take up 4.25 bits (the .25 is because it has a scale factor for every block of 32 parameters).

By contrast Llama 4 has each parameter take up 16 bits, however in the message that you quoted Framework is using “Q6” (which means that it has been quantized/compressed to slightly over 6 bits per parameter).

So the smaller size of each parameter means that the memory requirements for gpt-oss-120b are overall smaller than for Llama 4 Scout Q6 despite the higher total parameter count.

Yes. It should be quite fast.

It is a MoE (Mixture of Experts), which means that even though it has a lot of parameters it only activates and processes the most relevant ones for each token (but can change which ones are active very frequently which is why they all need to be in ram for good performance). As a result it is actually only processing 5.1 billion parameters for each token, which should make it quite speedy.

The model itself is slightly over 60 GB, although there is also memory needed for context length and other overhead. OpenAI advertises it as well suited for use cases where you have 80 GB VRAM available or more.

2 Likes

Thank you for these insights @Kyle_Reis, much appreciated!

@Claudia the model works fine on Ryzen AI 300 with 128GB RAM, and according to amdgpu_top it allocates 55.5GB GTT. It’d probably run on a 64G linux machine, leaving about 8GB for OS and everything else. At least with short context windows.

Edit: actually 64G may not be enough, as lmstudio is using another 12GB of RAM.

1 Like

Thanks!

Out of curiosity, how come it’s not affected by this problem?

It is affected, but (TLDR – works ok-ish today with LMStudio/llama.cpp vulkan backend):

  • LMStudio has an option to use llama.cpp’s vulkan backend. It implements portable offload to GPUs and is supposedly not as performant as ROCm,
  • ollama can be forced to use ROCm with export HSA_OVERRIDE_GFX_VERSION="11.0.1" that tells ROCm to use a codepath for a different GPU that is supported. YMMV, but it tends to work to some extent. I do get GPU reboots / system freezes with this, though, when trying to load larger models (over 10G), but it’s kind of random. Some people report that this GPU faking works fine for them – might be me not using AMD-provided amdgpu kernel module and going with mainline kernel. Also sometimes ollama decides that there’s not enought VRAM, and loads on CPU anyway; that seems to depend on if it’s using llama.cpp or their own impl in golang.
  • If all that fails, it’s possible to do CPU-only compute. Prompt processing is like 5-10x slower since it’s compute bound. Token generation though is about the same speed since it’s limited by RAM bandwidth, and not compute. This is a well known secret – you’d expect that a powerful integrated gpu or eventual NPU support will make LMStudio run much faster, but no, it’s still bottlenecked on RAM bw. Compute does matters for prompt processing, and afaik also for pic generation with stable diffusion and the likes, and also if you manage to get batching going and have a case for batch processing.
1 Like

The internal test at AMD reported ~30tps for the 120b model. I saw a few posts on X reporting similar results. So… :crossed_fingers:

2 Likes

That matches up nicely with 11 tps on AI 370 that I’m getting; max 395 has 3x bandwidth.

2 Likes

AMD as a definition of supported a litle different from what you think.
A support hardware is fully tested on all official release and all libs.

It do not mean that other hardware can not work.
llama.cpp can work with only HIP without any lib, it can use rocblas/hipblas for some compute with floating point compute. is is not use (no more?) for quantised model.
It is opensource, and most part existent in the code, “only” note tested/enable by AMD. But for exemple Fedora have the last few month made some (hard) work to build rocm on the distrib. More hardware har activated than what do AMD:

with fedora 42 the fx1103 gfx1150 gfx1151 gfx1152 are activated.

Now to be faire on FW16 (without dGPU) I can create and use HIP to create some backend for llama.cpp. but with large model it have some “aleatory” crach. (but it is the same with the vulkan backend). I do not know if there is something with FW16 (bios?) or the fx1103. There many crache with hardware video and artefacts…

use of HSA_OVERRIDE_GFX_VERSION="11.0.1" have some probleme (for my point of view.) it may work, but there is some diff on this GPU, for exemple the gfx1103 have only different L3 cache size, with that the sizes of block to use for good perf is realy different.

Something around 30-35 tps for the MAX 395 make sense. Waiting for the payment and shipping notice (Batch 3)….

Here is my result from the Ryzen AI 370 HX:
openai/gpt-oss-120b (“MXFP4”, 63.39 GB): 13.0 to 14.5 tps

Software: LM Studio 0.3.22b1 on Lubuntu 24.04.x (Linux) with the Vulkan backend
Hardware: AceMagic F3A (AMD Ryzen AI 370 HX) with 96 GB of RAM (via BIOS: 48 GB assigned to the iGPU; additional ~24 GB are available as GTT shared between the CPU and the iGPU… and the 120B model requires most of that GTT memory on top of the assigned VRAM).

1 Like

GPT-oss models running under Vulkan: Benchmark Framework Desktop Mainboard and 4-node cluster · Issue #21 · geerlingguy/ollama-benchmark · GitHub

20b: 45 t/s single node - 97W
120b: 33 t/s single node - 98W
120b: 24 t/s clustered (4x) - 138W

7 Likes

Here’s my current pp512/tg128 llama.cpp results for 120B:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
gpt-oss ?B F16 60.87 GiB 116.83 B Vulkan,RPC 99 1 pp512 448.94 ± 3.75
gpt-oss ?B F16 60.87 GiB 116.83 B Vulkan,RPC 99 1 tg128 33.06 ± 0.01

build: 6c7e9a54 (6118)

Basically the same speed as my sweeps from when gpt-oss-120b dropped, so those should still be current: strix-halo-testing/llm-bench/gpt-oss-120b-F16 at main · lhl/strix-halo-testing · GitHub

Note: I believe @geerlingguy was using Mesa RADV for his testing hence why his pp512 is so much lower. AMDVLK’s processing is 2X faster:

❯ AMD_VULKAN_ICD=RADV build/bin/llama-bench -m /models/gguf/gpt-oss-120b-F16.gguf -fa 1 (base)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
gpt-oss ?B F16 60.87 GiB 116.83 B Vulkan,RPC 99 1 pp512 209.98 ± 2.07
gpt-oss ?B F16 60.87 GiB 116.83 B Vulkan,RPC 99 1 tg128 33.16 ± 0.03

build: 6c7e9a54 (6118)

(if you look at the sweep page I linked you’ll see that different backends and flags can have a dramatic difference in both pp/tg speeds)

8 Likes

I tried reproducing this on my 13” HX 370. Does this look right?

$ ./llama-bench -m /var/home/bronson/.lmstudio/models/lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B Vulkan 99 1 pp512 76.38 ± 0.74
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B Vulkan 99 1 tg128 16.97 ± 0.14

build: cd6983d5 (6119)

I kinda hope I duffed something, because I’d sure like the Desktop, (256bit wide LPDDRx 8000, 256GB bandwidth), to be a lot more than twice as fast as my laptop (128bit wide DDR-5600, 90GB bandwidth).

The prompt processing looks more like I would expect.

EDIT: wait, it looks like the quants are different! I gotta find me another model. Update coming.

OK, this should be the same model…

./llama-bench -m ~/.lmstudio/models/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
gpt-oss ?B F16 60.87 GiB 116.83 B Vulkan 99 1 pp512 79.52 ± 0.41
gpt-oss ?B F16 60.87 GiB 116.83 B Vulkan 99 1 tg128 13.61 ± 0.07

build: cd6983d5 (6119)

So token generation on the Desktop is 2.43 times faster and prompt gen is 2.64 times faster. Since its RAM bandwidth is 2.8 times faster, this makes sense.

I like it, but I can’t wait to see the next mainboard… Cmon AMD, 8 or 12 channels, soldered down is fine. Blow the lanes wide open like the M3 Ultra. RAM bandwidth is the most important goal, you can do it!

1 Like

Yes, although you can see when using AMDVLK Vulkan, the pp512 is actually more like 6X faster.

You can see where it doesn’t crash out, the ROCm hipBLASLt can be even faster on prompt processing: strix-halo-testing/llm-bench/gpt-oss-120b-F16 at main · lhl/strix-halo-testing · GitHub

(but worse on token generation).

The next APU (Medusa Halo?) is sometimes rumored to have a 384-bit bus (+50% MBW) but we’ll see next year I guess. I think that the next big unlock for MBW for this form factor will really have to wait for DDR6 (est 2027, 8800-17600 MT/s). I’d also be looking for an GPU architecture upgrade. Per-CU, RDNA4 is 2X FP16/BF16, and 4X FP8/INT8/INT4. For AI/ML specifically, the use of RDNA3 is my biggest disappointment w/ Strix Halo.

tbt, for more MBW and compute, you can go last-gen EPYC and get more MBW for almost the same price (if you buy used or QS) and throw in a GPU for better prompt processing, however it’s of course a completely different form-factor/class of system. My EPYC workstation w/ a couple GPUs basically idles close to the power the Framework Desktop uses running at full tilt.

2 Likes

Thanks for the insights!

Btw, the unsloth variant in your second example claiming to have an “F16” quantization feels a little confusing to me. Doesn’t the vast majority of the weights still have to be 4 bits each, because otherwise the whole model could no longer fit in the 60-GiB ballpark?

It’s a bit confusing - in the repo they say “This is the MXFP4_MOE quant, now called F16, with our fixes” but it’s what they named the model, so ¯\_(ツ)_/¯

(FWIW, the gpt-oss-120b model is also misnamed since it’s 117B parameters.)

1 Like