Simplest possible setup for running gpt-oss-120b on Ubuntu (preferably llama.cpp)?

Anders_Rudkjaer_Norg · October 25, 2025, 1:37pm

Hello,

I am looking for the simplest possible setup to run gpt-oss-120b on Ubuntu (25.10).

I started out with the smaller gpt-oss-20b and got it running more or less with just

Install latest llama-b6838-bin-ubuntu-vulkan-x64.zip from Releases · ggml-org/llama.cpp · GitHub
Add user to render,video groups
Run the command line from the guide “./llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048”

It also works nicely with GitHub - openai/codex: Lightweight coding agent that runs in your terminal .

When trying to run the 120b model I get “radv/amdgpu: Not enough memory for command submission.”

Based on How To Run OpenAI’s GPT-OSS 20B and 120B Models on AMD Ryzen™ AI Processors and Radeon™ Graphics Cards I was hoping that I could run the 120b model without fiddling with gpu-memory configuration. Does anyone know if it is possible? Maybe with just some slightly different parameters to llama-server?

Best,

Anders

Djip · October 25, 2025, 2:02pm

May be the simple will be to create a “python” package and use the new python rocm package:

(ie: the new TheRock build…)

If we can it will be simple to use a pip install llama.cpp ...

lhl · October 25, 2025, 3:09pm

Why not just search? Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

Anders_Rudkjaer_Norg · October 25, 2025, 4:26pm

Hello. I did read that one. And thank you for the nice answers there. I also posted some small questions to Claudia there, but as to not high-jack the thread I also made this separate thread with a focus on the simplest possible setup.

Eugr · October 25, 2025, 6:24pm

The simplest possible setup is probably to download pre-built ROCm version from Lemonade SDK: GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration

Anders_Rudkjaer_Norg · October 26, 2025, 4:26pm

Unfortunately this also fails for me.

With

./llama-server -m /home/andersrudkjaer/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 0 --jinja -ngl 99

I get

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        ROCm0 model buffer size = 59851.68 MiB
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context:  ROCm_Host  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4608.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 4831838208
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache

Eugr · October 26, 2025, 4:56pm

Looks like you are running out of memory. Do you have a 128GB version? If so, how much is allocated to VRAM? Did you do pre-allocate in BIOS or you go GTT route? gpt-oss-120b needs about 70GB VRAM with full context.

Anders_Rudkjaer_Norg · October 26, 2025, 5:33pm

Yes. I have the 128GB version

andersrudkjaer@framework:~/projects/llm/llama-b1090-ubuntu-rocm$ cat /proc/meminfo
MemTotal:       128803180 kB
MemFree:        64974132 kB

I have not touched the bios. I hoped it would not be necessary. So I have whatever is default (batch 13, if that makes a difference).

Djip · October 26, 2025, 7:38pm

try this:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON  ./llama-server -m /home/andersrudkjaer/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 0 --jinja -ngl 99 --no-mmap -ub 2048 -b 8192

with that no need for BIOS/GTT change.

Eugr · October 26, 2025, 10:26pm

Default is 64GB VRAM, which won’t be enough for gpt-oss-120b with context. You can try Djip suggestion or set up ttm pages_limit (GTT is deprecated anyway) like described in this guide: https://strixhalo-homelab.d7.wtf/AI/AI-Capabilities-Overview#gpu-compute - will likely help with other tools.

Anders_Rudkjaer_Norg · October 28, 2025, 7:17am

This seems to work with no changes to bios or modprobe config. Thanks!

ominiverdi · October 28, 2025, 8:18am

For llama.cpp + gpt-oss-120b I found it very easy to run with Kyuz0 toolboxes.

The vulkan-amdvlk container has been solid for me. Working with all the GGUF models I tried so far.

Anders_Rudkjaer_Norg · October 31, 2025, 6:04pm

So, my conclusion so far is that it is actually very simple to just download llama.cpp from the lemonade builds here - GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration

And then run with the variable @Djip suggested - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON

Topic		Replies	Views
[TRACKING] Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b? Framework Desktop framework-desktop-ai-max-300 , ai	35	16376	January 25, 2026
Oss-gpt 120b large context stalls during llama.cpp checkpoints Framework Desktop ai	20	1051	October 23, 2025
Running ollama in docker on our Framework Desktop using the GPU Framework Desktop	34	4056	October 21, 2025
Which language models are you using? Framework Desktop	46	3089	March 7, 2026
Updated commands to increase max unified memory usage on Framework Desktop under Fedora 43? Framework Desktop framework-desktop-ai-max-300 , ai	24	5312	March 14, 2026

Simplest possible setup for running gpt-oss-120b on Ubuntu (preferably llama.cpp)?

Related topics