OpenAI just dropped a pair of open models. However, I’m failing to understand what the system requirements for e.g. gpt-oss-120b are. The documentation mentions a H100 in the description, but I don’t really know what that means, how to compare it to the AI Max+ 395 (128GB), nor how much memory 120 billion parameters are going to require.
@catastrophic has written about a model with similarly many parameters, Llama 4 Scout 17B (109B vs. OpenAI’s 117B):
But maybe I have misunderstood or overlooked something, and I figure that gpt-oss-120b might require space for another 18 billion params compared to the Llama 4 one.
So what I’d like to understand is:
Can I expect gpt-oss-120b to run somewhat decently on the 128GB Desktop?
How much memory does the model likely require during inference?
It’s a complex question. I’m downloading it right now. Speed wise, it all depend, you could have use case where your are fine with 1 t/s. What I can tell right now is that the model is released with what they call MXFP4 quantization. So the model is 63.39 GB on disk, it would fit on the Framework Desktop 128GB version for sure. It is using mixture of expert with 5.1B active parameters, so it should be quite fast for it’s size.
gpt-oss-120b does have more parameters, however OpenAI officially provides it in a format where most of the parameters take up 4.25 bits (the .25 is because it has a scale factor for every block of 32 parameters).
By contrast Llama 4 has each parameter take up 16 bits, however in the message that you quoted Framework is using “Q6” (which means that it has been quantized/compressed to slightly over 6 bits per parameter).
So the smaller size of each parameter means that the memory requirements for gpt-oss-120b are overall smaller than for Llama 4 Scout Q6 despite the higher total parameter count.
Yes. It should be quite fast.
It is a MoE (Mixture of Experts), which means that even though it has a lot of parameters it only activates and processes the most relevant ones for each token (but can change which ones are active very frequently which is why they all need to be in ram for good performance). As a result it is actually only processing 5.1 billion parameters for each token, which should make it quite speedy.
The model itself is slightly over 60 GB, although there is also memory needed for context length and other overhead. OpenAI advertises it as well suited for use cases where you have 80 GB VRAM available or more.
@Claudia the model works fine on Ryzen AI 300 with 128GB RAM, and according to amdgpu_top it allocates 55.5GB GTT. It’d probably run on a 64G linux machine, leaving about 8GB for OS and everything else. At least with short context windows.
Edit: actually 64G may not be enough, as lmstudio is using another 12GB of RAM.
It is affected, but (TLDR – works ok-ish today with LMStudio/llama.cpp vulkan backend):
LMStudio has an option to use llama.cpp’s vulkan backend. It implements portable offload to GPUs and is supposedly not as performant as ROCm,
ollama can be forced to use ROCm with export HSA_OVERRIDE_GFX_VERSION="11.0.1" that tells ROCm to use a codepath for a different GPU that is supported. YMMV, but it tends to work to some extent. I do get GPU reboots / system freezes with this, though, when trying to load larger models (over 10G), but it’s kind of random. Some people report that this GPU faking works fine for them – might be me not using AMD-provided amdgpu kernel module and going with mainline kernel. Also sometimes ollama decides that there’s not enought VRAM, and loads on CPU anyway; that seems to depend on if it’s using llama.cpp or their own impl in golang.
If all that fails, it’s possible to do CPU-only compute. Prompt processing is like 5-10x slower since it’s compute bound. Token generation though is about the same speed since it’s limited by RAM bandwidth, and not compute. This is a well known secret – you’d expect that a powerful integrated gpu or eventual NPU support will make LMStudio run much faster, but no, it’s still bottlenecked on RAM bw. Compute does matters for prompt processing, and afaik also for pic generation with stable diffusion and the likes, and also if you manage to get batching going and have a case for batch processing.
AMD as a definition of supported a litle different from what you think.
A support hardware is fully tested on all official release and all libs.
It do not mean that other hardware can not work.
llama.cpp can work with only HIP without any lib, it can use rocblas/hipblas for some compute with floating point compute. is is not use (no more?) for quantised model.
It is opensource, and most part existent in the code, “only” note tested/enable by AMD. But for exemple Fedora have the last few month made some (hard) work to build rocm on the distrib. More hardware har activated than what do AMD:
with fedora 42 the fx1103 gfx1150 gfx1151 gfx1152 are activated.
Now to be faire on FW16 (without dGPU) I can create and use HIP to create some backend for llama.cpp. but with large model it have some “aleatory” crach. (but it is the same with the vulkan backend). I do not know if there is something with FW16 (bios?) or the fx1103. There many crache with hardware video and artefacts…
use of HSA_OVERRIDE_GFX_VERSION="11.0.1" have some probleme (for my point of view.) it may work, but there is some diff on this GPU, for exemple the gfx1103 have only different L3 cache size, with that the sizes of block to use for good perf is realy different.
Something around 30-35 tps for the MAX 395 make sense. Waiting for the payment and shipping notice (Batch 3)….
Here is my result from the Ryzen AI 370 HX:
→ openai/gpt-oss-120b (“MXFP4”, 63.39 GB): 13.0 to 14.5 tps
Software: LM Studio 0.3.22b1 on Lubuntu 24.04.x (Linux) with the Vulkan backend Hardware: AceMagic F3A (AMD Ryzen AI 370 HX) with 96 GB of RAM (via BIOS: 48 GB assigned to the iGPU; additional ~24 GB are available as GTT shared between the CPU and the iGPU… and the 120B model requires most of that GTT memory on top of the assigned VRAM).
I kinda hope I duffed something, because I’d sure like the Desktop, (256bit wide LPDDRx 8000, 256GB bandwidth), to be a lot more than twice as fast as my laptop (128bit wide DDR-5600, 90GB bandwidth).
The prompt processing looks more like I would expect.
EDIT: wait, it looks like the quants are different! I gotta find me another model. Update coming.
So token generation on the Desktop is 2.43 times faster and prompt gen is 2.64 times faster. Since its RAM bandwidth is 2.8 times faster, this makes sense.
I like it, but I can’t wait to see the next mainboard… Cmon AMD, 8 or 12 channels, soldered down is fine. Blow the lanes wide open like the M3 Ultra. RAM bandwidth is the most important goal, you can do it!
The next APU (Medusa Halo?) is sometimes rumored to have a 384-bit bus (+50% MBW) but we’ll see next year I guess. I think that the next big unlock for MBW for this form factor will really have to wait for DDR6 (est 2027, 8800-17600 MT/s). I’d also be looking for an GPU architecture upgrade. Per-CU, RDNA4 is 2X FP16/BF16, and 4X FP8/INT8/INT4. For AI/ML specifically, the use of RDNA3 is my biggest disappointment w/ Strix Halo.
tbt, for more MBW and compute, you can go last-gen EPYC and get more MBW for almost the same price (if you buy used or QS) and throw in a GPU for better prompt processing, however it’s of course a completely different form-factor/class of system. My EPYC workstation w/ a couple GPUs basically idles close to the power the Framework Desktop uses running at full tilt.
Btw, the unsloth variant in your second example claiming to have an “F16” quantization feels a little confusing to me. Doesn’t the vast majority of the weights still have to be 4 bits each, because otherwise the whole model could no longer fit in the 60-GiB ballpark?
It’s a bit confusing - in the repo they say “This is the MXFP4_MOE quant, now called F16, with our fixes” but it’s what they named the model, so ¯\_(ツ)_/¯
(FWIW, the gpt-oss-120b model is also misnamed since it’s 117B parameters.)