Which language models are you using?

ndom91 · January 25, 2026, 3:41pm

Sorry no idea about Windows unfortunately. This was designed for Linux.

You probably have to modify the devices you’re passing in in the docker-compose.yml. Those /dev/kfd / /dev/dri devices are linux specific. If it works at all for you it’s running on the CPU.

Rand_o · January 25, 2026, 5:36pm

I really like opencode with the oh-my-opencode plugin. I’ll admit I have not really tried codex or claude code but I was already happy with what I had and found no reason to look else where. Give that setup a try and let me know how you think it compares to the others

Thomas_Munn · January 25, 2026, 6:26pm

I have managed 70 tokens per second (with some of the optimizations in another thread) with qwen coder a3b (6 bit quantized). It does a decent job at coding, and I use gemini as the ‘planner’.

0xdeadbeef · January 25, 2026, 6:33pm

I have not used any of the q6 ggufs yet - will give them a shot, have only been using q4 since I like having as many options loaded at once as I can, but q4 caps me at about 60 tops regardless of model - thanks for this info! Another option I’m looking at is the new router functionality of llama.cpp, but I’m hesitant as it doesn’t look like you can pin certain models or give a priority preference on what models get destaged when memory pressure occurs. Might play with that this week.

Are you using online gemini for planning, or gemma3 locally?

Thomas_Munn · January 26, 2026, 2:50am

I use a method I derived from reading, ai and intuition: SPLIT BRAIN mode. I first of all use a local program called repomix, which first generates an xml schema of the local project (such as data, etc, programs). Then I upload to gemini in ‘smart’ or thinking mode, and ask a lot of questions, with the final prompt being something like “I am doing project X and need to do blah and blah” Here are some pieces of information (schemas are VERY important). I want the local cline to ONLY code and not to do ANY architecture. Please generate an XML plan that I can feed to the model and instruct me in which order I should prompt it”. The xml paradigm seems to work REALLY well, and without the thinking part cline seems to code at a blistering 70 tokens (or 60 on a bad one) with up to 60,000 tokens so far. The main thing I have seen issues with is the schema generation (for instance I was asking how to convert from grav cms to ZOLA and it totally borked on the formats because it was using older ones. Its especially imporant to include date cutoffs if you are using something ‘current’ or it will use old knowledge and you will be sad.

Zac_Nowicki · January 28, 2026, 1:45am

Hey y’all, I’ve been trying out gpt-oss-120B via AMD’s lemonade and opencode. I’m a bit new to this stuff, so wondering if someone could point me in the right direction (or correct my expectations) for some issues I’ve been having with it:

On refactoring tasks, it is very common that after the first or second edit, the model will make a “corrupt” edit. For example, if it’s rewriting a 50 line function, it will replace the whole function, BUT leave ~20 lines of the old implementation hanging off the end. A lot of time is lost as it notices this and fixes it. If it’s writing something from scratch, it’s fine. If it’s making tiny edits, it’s fine. There’s a sweet spot size of edit where there seems to be a high chance of it screwing up.
Occasionally, the model will make edits with literal “\n” written in the code, instead of actual newline sequences.
One role I’m trying to fill with an agent is one that scrutinizes files for potential bugs and documents them on the filesystem for injesting by other systems. OSS is quite good at the reviewing part, but it’s bad at following instructions; it only documents maybe 1/10 issues that it identifies in its thinking. I’ve tried prompting it to “Find ONE issue. Stop. Document it. Repeat” (tldr), and that helped a little, but still not really what I’m looking for. I might just need a new strategy for this (brain dump to file → another model picks apart), but thought I’d mention in case it could be improved…

Other models in my inventory don’t seem to have this issue, just OSS-120B.

I’ve tried a ton of variations on system prompt (stock opencode Build, custom ones I made up, prompts from others I found online, …), session prompt, model options, and it’s still not as reliable as I’d like.

Lemonade pulls ggml’s distribution of OSS-120B. I’ve also tried LM Studio for the heck of it, and that actually introduced even more problems, like broken too calls and odd issues where the raw harmony output comes through…

Asked on opencode Discord but doesn’t seem like others are having the same issues.

Any tips appreciated, or at least to know if I’m not the only one…

Guest534 · February 1, 2026, 12:22pm

Hey Zac,

I have had issues with oss120 and structured output or understanding it. It notoriously breaks my code by not closing functions properly so you are not alone. Any of these models even the paid ones start to degrade once the context length gets above 60k tokens. Here is my rube goldberg setup that I’m using for refactoring my personal php app and it’s working well:

Windows 11 debloated and light

Docker running php, mysql and opencode

Lm studio

Project loaded with phpstorm running which automatically versions so easy to watch changes in real time

Bash into container and run opencode

Planning and small tasks essential

Models: gptoss 120b, granite 4b

Code:

Models: GLM Flash 4.7 8quant, qwen 30b coder, Nemotron 30b

Locally I can get a solid scaffolding and design built. Html, javascript, css

When things break I use the free large Models on opencode to fix if needed. I always start new sessions when context size gets over 65-70k tokens. Baby steps. But read your code because if you can fix it do it.

Zac_Nowicki · February 3, 2026, 2:03am

Yea, in my instance it doesn’t really seem to matter how big the context is, it’s not uncommon for 120B to mess up an edit when I’m only 10-20k tokens into the job. But good to know I’m not entirely alone on some of these things.

Guest534 · February 3, 2026, 10:15am

Its definitely a lazy model. What quant are you using? You could try as an experiment to use Antigravity which has it on its free tier to do a job you know it fails at to see if you get same results. But it definitely cuts corners which I think is just baked into it

Nico_Thomaier · February 3, 2026, 5:10pm

Currently I am trying to optimize gpt-oss-120b:MXFP4 and glm-4.7-flash:Q6_XL performance. Especially a high prompt processing speed on long context tasks is currently my main goal. Squeezing the last bit out of the generation Speed might be fun, but with such small models i believe it is crucial to give them as much high quality context as possible. Waiting 30 Minutes for them to crunch through 120k Tokens is kind of killing a lot of usecases for me.

$ uname -a
Linux frmwrk01 6.18.7-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jan 23 16:42:34 UTC 2026 x86_64 GNU/Linux

$ lsb_release -a
LSB Version:    n/a
Distributor ID: Fedora
Description:    Fedora Linux 43 (Workstation Edition)
Release:        43

So… Currently the winner for Performance on Long Context Tasks for GPT-OSS-120b is the llama-rocm7-nightlies toolbox with the following parameters:

$ lama-bench \
   -m “$HOME/models/gpt-oss-120b/mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf” \
   -ub 2048 \
   -b 8192 
   -p 2000,20000,60000,120000 \
    -n 128 \ 
    -fa 1 \
    -t 16 \
    -r 1
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	8192	2048	1	pp2000	547.26 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	8192	2048	1	pp20000	435.89 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	8192	2048	1	pp60000	288.28 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	8192	2048	1	pp120000	190.54 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	8192	2048	1	tg128	52.04 ± 0.00

As soon as i have a winner for GLM-4.7-flash i will also post the results

Djip · February 3, 2026, 7:34pm

For now use fedora43 rocm 6.4.4 there is a regression bug on 7.1+ / 7.9+

github.com/ggml-org/llama.cpp

Misc. bug: Performance regression using ROCm on Strix Halo

opened 08:51PM - 10 Dec 25 UTC

eugr

bug-unconfirmed

### Name and Version build: 4df6e859e (7349) ### Operating systems Linux ###… Which llama.cpp modules do you know to be affected? llama-server, llama-bench ### Command line ```shell build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0 ``` ### Problem description & steps to reproduce This commit (as found by using git bisect) has introduced significant performance regression in prompt processing on Strix Halo and ROCm 7.x: [668ed765742065f82c2899e101ee4384d6669f11] HIP: enable WMMA-MMQ INT kernels for RDNA 3 (#17576) I tried to compile with both rocWMMA and without with the same result. Here are llama-bench results. Note that they have a wider gap in bench numbers, but I didn't run full benches when doing git bisect, the first row was usually enough. Command: ```bash build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0 ``` Performance before: | model | size | params | backend | test | t/s | | ---------------------- | --------: | -------: | ------- | --------------: | ------------: | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 | 900.29 ± 2.04 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 | 52.28 ± 0.01 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d4096 | 758.10 ± 3.08 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d4096 | 48.68 ± 0.01 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d8192 | 651.92 ± 1.93 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d8192 | 46.57 ± 0.02 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d16384 | 503.05 ± 1.17 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d16384 | 42.55 ± 0.01 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d32768 | 329.94 ± 2.27 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d32768 | 36.02 ± 0.04 | build: 13628d8bd (7243) After: | model | size | params | backend | test | t/s | | | ------------------------------- | ---------: | ---------: | ----------- | --------------: | -------------------: | --- | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 | 543.49 ± 1.68 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 | 52.30 ± 0.01 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d4096 | 487.99 ± 0.37 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d4096 | 48.75 ± 0.01 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d8192 | 437.11 ± 0.50 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d8192 | 46.60 ± 0.02 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d16384 | 357.88 ± 0.84 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d16384 | 42.98 ± 0.03 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d32768 | 263.72 ± 0.79 | | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d32768 | 36.45 ± 0.01 | | build: 4df6e859e (7349) ### First Bad Commit [668ed765742065f82c2899e101ee4384d6669f11] HIP: enable WMMA-MMQ INT kernels for RDNA 3 (#17576) ### Relevant log output ```shell ```

github.com/ROCm/rocm-systems

[Issue]: ROCm 7+ Performance regression on llama.cpp

opened 06:06PM - 26 Jan 26 UTC

kyuz0

status: triage

### Problem Description There is a considerable performance regression when usi…ng Llama.cpp going from ROCM 6.4.4 to 7.2 or the ROCm nightly builds from TheRock. Source: https://github.com/kyuz0/amd-strix-halo-toolboxes/issues/45#issuecomment-3796854322 Llama.cpp with ROCm 6.4.4 is faster than when using 7.2, which is the worst performance regression (3x slower !!), and 7-nightlies (from TheRock), almost 2x slower than 6.4.4: Examples: ```rocm-7-nightlies model size params backend ngl n_ubatch fa test t/s gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 815.27 ± 7.37 gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 72.97 ± 0.29 build: 8f91ca54e (7822) rocm-7.2 model size params backend ngl n_ubatch fa test t/s gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 545.11 ± 6.65 gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 73.21 ± 0.06 build: 8f91ca54e (7822) rocm-6.4.4 model size params backend ngl n_ubatch fa test t/s gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 1648.22 ± 20.43 gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 72.96 ± 0.05 build: 8f91ca54e (7822) ``` Full table here for many model architectures and quantizations: https://kyuz0.github.io/amd-strix-halo-toolboxes/ ### Operating System Fedora 43 (6.18.3-200) ### CPU AMD Ryzen AI MAX 395+ ### GPU Strix Halo gfx1151 ### ROCm Version ROCm 7+ ### ROCm Component _No response_ ### Steps to Reproduce _No response_ ### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support _No response_ ### Additional Information _No response_

At least they find where it come from, but need more time for therock patch…

Nico_Thomaier · February 4, 2026, 8:18am

I saw this problem on the github repo of kyus0. Yesterday i already downloaded the ROCm-6.4.4 toolbox for testing, but did not come around to it. We will see if they fix it before i test 6.4.4.

Thomas_Munn · February 5, 2026, 3:05am

Anyone having much luck with 7.2 and kernel 6.18.6?

Djip · February 5, 2026, 8:53pm

There is a workaround (build opt). I think kyus0 test it and update is “toolbox” with it.

Rand_o · February 14, 2026, 11:13pm

I have an opencode setup where I have GLM-4.7-Flash as more of an architecting agent and qwen3-coder-next as the implementing agent. But glm just does not perform well at all it is super slow compared to qwen which is way bigger and way more performant. Anyone have any suggestions for a good reasoning model to use instead? looking for something 20-30 Gb. Anyone have experience with a better model

Thomas_Munn · February 16, 2026, 10:16pm

I finally got my ROCM 7.2 and 6.18.9 working! Rocm is solid finally, and seems at least as fast if not performant than the vulkan for the first time. I ran it for about 1.5 hours and got up to 150k tokens, with times for prompt at 300tps and reponses at about 20-35tps depending on where in the queue I am. I am running with qwen3 coder next (80b) and a encoder model (qwen 3 encoder) to make ROO do local indexing of my source trees. The main thing was properly compiling (I don’t lke docker) so that we get our proper memory. BEWARE the ids of AI.

!/usr/bin/fish

# 1. Get the Code
# Checks if the directory exists. If not, clones it.
if test -d llama.cpp
    echo "Directory exists. Updating to latest master..."
    cd llama.cpp
    git fetch origin
    # Force reset to match remote exactly (discards local changes)
    git reset --hard origin/master
else
    echo "Cloning repository..."
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
end

# 2. Clean and Prepare
# Removes build artifacts and untracked files
git clean -xdf
git submodule update --init --recursive

# 3. Configure CMake (Strix Halo Optimized)
# -DGGML_HIP_UMA=ON: Zero-copy memory for 128GB APU
# -DAMDGPU_TARGETS=gfx1151: RDNA 3.5 native target
cmake -S . -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1151 \
    -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_HIP_UMA=ON \
    -DROCM_PATH=/opt/rocm \
    -DHIP_PATH=/opt/rocm \
    -DHIP_PLATFORM=amd \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_EXE_LINKER_FLAGS="-static-libgcc -static-libstdc++"

# 4. Build
# Uses all cores (-j) to compile
cmake --build build --config Release -- -j(nproc)

# 5. Verify
echo "Build complete. Verifying linkage:"
ldd build/bin/llama-cli | grep hip

This seems to build a nice stable (at least for me) usable rocm llama.cpp (at least for the model I tested, qwen 3 next 80b). This also assumes ROCM 7.2 is installed and working on your main system. And required libs are installed etc.

0xdeadbeef · February 17, 2026, 5:38am

LOL, there must be something in the air tonight (for success). I could not get llama.cpp (vulkan or rocm) to stop segfaulting after about 20-30k tokens when using opencode for the past two days whem using qwen-coder-next q8_0.

I finally found the autoparsing branch of llama from pwilkins, rebuilt @kyuz0’s rocm7-2 toolbox with it, and just got through an intensive planning session with opencode+superpowers brainstorming skill, going through about 350-400k tokens without a crash. Yay.

Keeping up with the ecosystem and landscape around AMD in this space is exponentially harder than nvidia. If I didn’t know linux and hardware well, I’d be lost weeks/months ago - we have a long way to go to make this usable for normies.

Djip · February 18, 2026, 12:00am

you can remove -DGGML_HIP_UMA=ON it is no more part of the cuda/hip backend .
and add export GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON before llama.cpp exec. (it is the new way to use “UMA”)
that way you do not need to change/config the GTT

Michael_Edward_Davis · February 22, 2026, 2:37am

I’m curious if you can find a good reasoning model in that range. I’m using the following which are decent, but by no means are excellent.

DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf - 43 GB
GLM-4.7-Flash-UD-Q8_K_XL.gguf - 35 GB

s_m8o · February 26, 2026, 7:00pm

Ironic indeed.

So are you implying that Nemotron (is that Nemotron 3?) doesn’t “run best on NVIDIA GPUs” for inference?

Further, if you have enough experience or have read things with the following to offer an opinion, does Nemotron run as equally performative on the Strix Halo APU to let’s say if someone runs it on two 4060 or 4070 16 Gig cards?

Wondering further, have you done any fine-tuning and if so does that change the answer?

Much thanks for the answers if you have any experience with this.

Topic		Replies	Views
[TRACKING] Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b? Framework Desktop framework-desktop-ai-max-300 , ai	35	15053	January 25, 2026
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	9755	February 2, 2026
AMD AI Max+ 395 128GB with cline Framework Desktop ai	14	1670	September 5, 2025
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	19469	September 29, 2025
Framework 13 + Ryzen AI + Linux Distro + LLM Linux ubuntu , fedora	20	4603	February 11, 2026

Which language models are you using?

Related topics