Sorry no idea about Windows unfortunately. This was designed for Linux.
You probably have to modify the devices you’re passing in in the docker-compose.yml. Those /dev/kfd / /dev/dri devices are linux specific. If it works at all for you it’s running on the CPU.
I really like opencode with the oh-my-opencode plugin. I’ll admit I have not really tried codex or claude code but I was already happy with what I had and found no reason to look else where. Give that setup a try and let me know how you think it compares to the others
I have managed 70 tokens per second (with some of the optimizations in another thread) with qwen coder a3b (6 bit quantized). It does a decent job at coding, and I use gemini as the ‘planner’.
I have not used any of the q6 ggufs yet - will give them a shot, have only been using q4 since I like having as many options loaded at once as I can, but q4 caps me at about 60 tops regardless of model - thanks for this info! Another option I’m looking at is the new router functionality of llama.cpp, but I’m hesitant as it doesn’t look like you can pin certain models or give a priority preference on what models get destaged when memory pressure occurs. Might play with that this week.
Are you using online gemini for planning, or gemma3 locally?
I use a method I derived from reading, ai and intuition: SPLIT BRAIN mode. I first of all use a local program called repomix, which first generates an xml schema of the local project (such as data, etc, programs). Then I upload to gemini in ‘smart’ or thinking mode, and ask a lot of questions, with the final prompt being something like “I am doing project X and need to do blah and blah” Here are some pieces of information (schemas are VERY important). I want the local cline to ONLY code and not to do ANY architecture. Please generate an XML plan that I can feed to the model and instruct me in which order I should prompt it”. The xml paradigm seems to work REALLY well, and without the thinking part cline seems to code at a blistering 70 tokens (or 60 on a bad one) with up to 60,000 tokens so far. The main thing I have seen issues with is the schema generation (for instance I was asking how to convert from grav cms to ZOLA and it totally borked on the formats because it was using older ones. Its especially imporant to include date cutoffs if you are using something ‘current’ or it will use old knowledge and you will be sad.
Hey y’all, I’ve been trying out gpt-oss-120B via AMD’s lemonade and opencode. I’m a bit new to this stuff, so wondering if someone could point me in the right direction (or correct my expectations) for some issues I’ve been having with it:
On refactoring tasks, it is very common that after the first or second edit, the model will make a “corrupt” edit. For example, if it’s rewriting a 50 line function, it will replace the whole function, BUT leave ~20 lines of the old implementation hanging off the end. A lot of time is lost as it notices this and fixes it. If it’s writing something from scratch, it’s fine. If it’s making tiny edits, it’s fine. There’s a sweet spot size of edit where there seems to be a high chance of it screwing up.
Occasionally, the model will make edits with literal “\n” written in the code, instead of actual newline sequences.
One role I’m trying to fill with an agent is one that scrutinizes files for potential bugs and documents them on the filesystem for injesting by other systems. OSS is quite good at the reviewing part, but it’s bad at following instructions; it only documents maybe 1/10 issues that it identifies in its thinking. I’ve tried prompting it to “Find ONE issue. Stop. Document it. Repeat” (tldr), and that helped a little, but still not really what I’m looking for. I might just need a new strategy for this (brain dump to file → another model picks apart), but thought I’d mention in case it could be improved…
Other models in my inventory don’t seem to have this issue, just OSS-120B.
I’ve tried a ton of variations on system prompt (stock opencode Build, custom ones I made up, prompts from others I found online, …), session prompt, model options, and it’s still not as reliable as I’d like.
Lemonade pulls ggml’s distribution of OSS-120B. I’ve also tried LM Studio for the heck of it, and that actually introduced even more problems, like broken too calls and odd issues where the raw harmony output comes through…
Asked on opencode Discord but doesn’t seem like others are having the same issues.
Any tips appreciated, or at least to know if I’m not the only one…
I have had issues with oss120 and structured output or understanding it. It notoriously breaks my code by not closing functions properly so you are not alone. Any of these models even the paid ones start to degrade once the context length gets above 60k tokens. Here is my rube goldberg setup that I’m using for refactoring my personal php app and it’s working well:
Windows 11 debloated and light
Docker running php, mysql and opencode
Lm studio
Project loaded with phpstorm running which automatically versions so easy to watch changes in real time
Locally I can get a solid scaffolding and design built. Html, javascript, css
When things break I use the free large Models on opencode to fix if needed. I always start new sessions when context size gets over 65-70k tokens. Baby steps. But read your code because if you can fix it do it.
Yea, in my instance it doesn’t really seem to matter how big the context is, it’s not uncommon for 120B to mess up an edit when I’m only 10-20k tokens into the job. But good to know I’m not entirely alone on some of these things.
Its definitely a lazy model. What quant are you using? You could try as an experiment to use Antigravity which has it on its free tier to do a job you know it fails at to see if you get same results. But it definitely cuts corners which I think is just baked into it
Currently I am trying to optimize gpt-oss-120b:MXFP4 and glm-4.7-flash:Q6_XL performance. Especially a high prompt processing speed on long context tasks is currently my main goal. Squeezing the last bit out of the generation Speed might be fun, but with such small models i believe it is crucial to give them as much high quality context as possible. Waiting 30 Minutes for them to crunch through 120k Tokens is kind of killing a lot of usecases for me.
$ uname -a
Linux frmwrk01 6.18.7-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jan 23 16:42:34 UTC 2026 x86_64 GNU/Linux
$ lsb_release -a
LSB Version: n/a
Distributor ID: Fedora
Description: Fedora Linux 43 (Workstation Edition)
Release: 43
So… Currently the winner for Performance on Long Context Tasks for GPT-OSS-120b is the llama-rocm7-nightlies toolbox with the following parameters:
I saw this problem on the github repo of kyus0. Yesterday i already downloaded the ROCm-6.4.4 toolbox for testing, but did not come around to it. We will see if they fix it before i test 6.4.4.
I have an opencode setup where I have GLM-4.7-Flash as more of an architecting agent and qwen3-coder-next as the implementing agent. But glm just does not perform well at all it is super slow compared to qwen which is way bigger and way more performant. Anyone have any suggestions for a good reasoning model to use instead? looking for something 20-30 Gb. Anyone have experience with a better model
I finally got my ROCM 7.2 and 6.18.9 working! Rocm is solid finally, and seems at least as fast if not performant than the vulkan for the first time. I ran it for about 1.5 hours and got up to 150k tokens, with times for prompt at 300tps and reponses at about 20-35tps depending on where in the queue I am. I am running with qwen3 coder next (80b) and a encoder model (qwen 3 encoder) to make ROO do local indexing of my source trees. The main thing was properly compiling (I don’t lke docker) so that we get our proper memory. BEWARE the ids of AI.
!/usr/bin/fish
# 1. Get the Code
# Checks if the directory exists. If not, clones it.
if test -d llama.cpp
echo "Directory exists. Updating to latest master..."
cd llama.cpp
git fetch origin
# Force reset to match remote exactly (discards local changes)
git reset --hard origin/master
else
echo "Cloning repository..."
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
end
# 2. Clean and Prepare
# Removes build artifacts and untracked files
git clean -xdf
git submodule update --init --recursive
# 3. Configure CMake (Strix Halo Optimized)
# -DGGML_HIP_UMA=ON: Zero-copy memory for 128GB APU
# -DAMDGPU_TARGETS=gfx1151: RDNA 3.5 native target
cmake -S . -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP_UMA=ON \
-DROCM_PATH=/opt/rocm \
-DHIP_PATH=/opt/rocm \
-DHIP_PLATFORM=amd \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_EXE_LINKER_FLAGS="-static-libgcc -static-libstdc++"
# 4. Build
# Uses all cores (-j) to compile
cmake --build build --config Release -- -j(nproc)
# 5. Verify
echo "Build complete. Verifying linkage:"
ldd build/bin/llama-cli | grep hip
This seems to build a nice stable (at least for me) usable rocm llama.cpp (at least for the model I tested, qwen 3 next 80b). This also assumes ROCM 7.2 is installed and working on your main system. And required libs are installed etc.
LOL, there must be something in the air tonight (for success). I could not get llama.cpp (vulkan or rocm) to stop segfaulting after about 20-30k tokens when using opencode for the past two days whem using qwen-coder-next q8_0.
I finally found the autoparsing branch of llama from pwilkins, rebuilt @kyuz0’s rocm7-2 toolbox with it, and just got through an intensive planning session with opencode+superpowers brainstorming skill, going through about 350-400k tokens without a crash. Yay.
Keeping up with the ecosystem and landscape around AMD in this space is exponentially harder than nvidia. If I didn’t know linux and hardware well, I’d be lost weeks/months ago - we have a long way to go to make this usable for normies.