Hello,
I am looking for the simplest possible setup to run gpt-oss-120b on Ubuntu (25.10).
I started out with the smaller gpt-oss-20b and got it running more or less with just
Install latest llama-b6838-bin-ubuntu-vulkan-x64.zip from Releases · ggml-org/llama.cpp · GitHub
Add user to render,video groups
Run the command line from the guide “./llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048”
It also works nicely with GitHub - openai/codex: Lightweight coding agent that runs in your terminal .
When trying to run the 120b model I get “radv/amdgpu: Not enough memory for command submission.”
Based on How To Run OpenAI’s GPT-OSS 20B and 120B Models on AMD Ryzen™ AI Processors and Radeon™ Graphics Cards I was hoping that I could run the 120b model without fiddling with gpu-memory configuration. Does anyone know if it is possible? Maybe with just some slightly different parameters to llama-server?
Best,
Anders
Djip
October 25, 2025, 2:02pm
2
May be the simple will be to create a “python” package and use the new python rocm package:
Learn how to install AMD ROCm 7.9.0 for supported Instinct GPUs and Ryzen AI APUs on Ubuntu, RHEL, and Windows. This step-by-step guide covers prerequisites, driver setup, installation methods (pip and tarball), and troubleshooting.
(ie: the new TheRock build…)
If we can it will be simple to use a pip install llama.cpp ...
lhl
October 25, 2025, 3:09pm
3
Hello. I did read that one. And thank you for the nice answers there. I also posted some small questions to Claudia there, but as to not high-jack the thread I also made this separate thread with a focus on the simplest possible setup.
Eugr
October 25, 2025, 6:24pm
5
The simplest possible setup is probably to download pre-built ROCm version from Lemonade SDK: GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration
2 Likes
Unfortunately this also fails for me.
With
./llama-server -m /home/andersrudkjaer/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 0 --jinja -ngl 99
I get
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: ROCm0 model buffer size = 59851.68 MiB
load_tensors: CPU_Mapped model buffer size = 586.82 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4608.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 4831838208
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
Eugr
October 26, 2025, 4:56pm
7
Looks like you are running out of memory. Do you have a 128GB version? If so, how much is allocated to VRAM? Did you do pre-allocate in BIOS or you go GTT route? gpt-oss-120b needs about 70GB VRAM with full context.
Yes. I have the 128GB version
andersrudkjaer@framework:~/projects/llm/llama-b1090-ubuntu-rocm$ cat /proc/meminfo
MemTotal: 128803180 kB
MemFree: 64974132 kB
I have not touched the bios. I hoped it would not be necessary. So I have whatever is default (batch 13, if that makes a difference).
Djip
October 26, 2025, 7:38pm
9
try this:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON ./llama-server -m /home/andersrudkjaer/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 0 --jinja -ngl 99 --no-mmap -ub 2048 -b 8192
with that no need for BIOS/GTT change.
1 Like
Eugr
October 26, 2025, 10:26pm
10
Default is 64GB VRAM, which won’t be enough for gpt-oss-120b with context. You can try Djip suggestion or set up ttm pages_limit (GTT is deprecated anyway) like described in this guide: https://strixhalo-homelab.d7.wtf/AI/AI-Capabilities-Overview#gpu-compute - will likely help with other tools.
1 Like
This seems to work with no changes to bios or modprobe config. Thanks!
For llama.cpp + gpt-oss-120b I found it very easy to run with Kyuz0 toolboxes.
Contribute to kyuz0/amd-strix-halo-toolboxes development by creating an account on GitHub.
The vulkan-amdvlk container has been solid for me. Working with all the GGUF models I tried so far.
1 Like
So, my conclusion so far is that it is actually very simple to just download llama.cpp from the lemonade builds here - GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration
And then run with the variable @Djip suggested - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
2 Likes