Simplest possible setup for running gpt-oss-120b on Ubuntu (preferably llama.cpp)?

Hello,

I am looking for the simplest possible setup to run gpt-oss-120b on Ubuntu (25.10).

I started out with the smaller gpt-oss-20b and got it running more or less with just

  1. Install latest llama-b6838-bin-ubuntu-vulkan-x64.zip from Releases · ggml-org/llama.cpp · GitHub
  2. Add user to render,video groups
  3. Run the command line from the guide “./llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048”

It also works nicely with GitHub - openai/codex: Lightweight coding agent that runs in your terminal .

When trying to run the 120b model I get “radv/amdgpu: Not enough memory for command submission.”

Based on How To Run OpenAI’s GPT-OSS 20B and 120B Models on AMD Ryzen™ AI Processors and Radeon™ Graphics Cards I was hoping that I could run the 120b model without fiddling with gpu-memory configuration. Does anyone know if it is possible? Maybe with just some slightly different parameters to llama-server?

Best,

Anders

May be the simple will be to create a “python” package and use the new python rocm package:

(ie: the new TheRock build…)

If we can it will be simple to use a pip install llama.cpp ...

Why not just search? Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b?

Hello. I did read that one. And thank you for the nice answers there. I also posted some small questions to Claudia there, but as to not high-jack the thread I also made this separate thread with a focus on the simplest possible setup.

The simplest possible setup is probably to download pre-built ROCm version from Lemonade SDK: GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration

2 Likes

Unfortunately this also fails for me.

With

./llama-server -m /home/andersrudkjaer/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 0 --jinja -ngl 99

I get

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        ROCm0 model buffer size = 59851.68 MiB
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context:  ROCm_Host  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4608.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 4831838208
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache

Looks like you are running out of memory. Do you have a 128GB version? If so, how much is allocated to VRAM? Did you do pre-allocate in BIOS or you go GTT route? gpt-oss-120b needs about 70GB VRAM with full context.

Yes. I have the 128GB version

andersrudkjaer@framework:~/projects/llm/llama-b1090-ubuntu-rocm$ cat /proc/meminfo
MemTotal:       128803180 kB
MemFree:        64974132 kB

I have not touched the bios. I hoped it would not be necessary. So I have whatever is default (batch 13, if that makes a difference).

try this:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON  ./llama-server -m /home/andersrudkjaer/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 0 --jinja -ngl 99 --no-mmap -ub 2048 -b 8192

with that no need for BIOS/GTT change.

1 Like

Default is 64GB VRAM, which won’t be enough for gpt-oss-120b with context. You can try Djip suggestion or set up ttm pages_limit (GTT is deprecated anyway) like described in this guide: https://strixhalo-homelab.d7.wtf/AI/AI-Capabilities-Overview#gpu-compute - will likely help with other tools.

1 Like

This seems to work with no changes to bios or modprobe config. Thanks!

For llama.cpp + gpt-oss-120b I found it very easy to run with Kyuz0 toolboxes.

The vulkan-amdvlk container has been solid for me. Working with all the GGUF models I tried so far.

1 Like

So, my conclusion so far is that it is actually very simple to just download llama.cpp from the lemonade builds here - GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration

And then run with the variable @Djip suggested - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON

2 Likes