It’s using the vulkan backend for llama.cpp and will expose an OpenAI-compatible API endpoint so you can easily link it into opencode or openweb-ui. The opencode experience really does feel very snappy, im extremely pleasantly surprised. Will give this a go in place of Claude Code this coming week at work!
Both unsloth BF16 and Q8 (to a somewhat lesser extent) get into the “// // // …” loop from Opencode. There are other reports on the internet about it. Could both of you share how you run the models (llama.cpp, vulkan, rocm, lemonade?) and llama.cpp arguments for the models? Opencode use? Which Linux distro (if Linux at all?)
I’m on Omarchy (edge channel) which is Arch Linux under the hood. Running llama.cpp-hip from AUR and llemonade llama.cpp ROCM builds (against TheRock builds). I thought it might be because of Omarchy Stable being behind on firmware, but switching to Edge channel has me on the latest packages with the same issue. Interstingly, ggml-org/GLM-4.7-Flash-GGUF:Q8_0 works well, but I’d really like the BF16 model. Tried TeichAI’s F16 distill of it which is based on the unsloth build - same thing. Someone there reported that kv-cache Q8_0 is causing it, so I briefly tried F16 and the prompt processing was so extremely slow like 10x Q8) that I didn’t wait for it to be done…
it took me nearly 2 weeks to dial in GLM-4.7-Flash. Part of that was updates to llama.cpp getting push and part of it was my arguments. I have had the best results using opencode with the following and it is pretty stable for me. Speed is acceptable but I am using Q8_K_XL and not BF16.
–server
–host
“0.0.0.0”
–port
“8080”
–model
/models/GLM-4.7-Flash-UD-Q8_K_XL.gguf
–jinja
–device
“Vulkan0”
–ctx-size
“131072”
–temp
“0.7”
–top-p
“1.0”
–min-p
“0.01”
–gpu-layers
“auto”
–kv-unified
–no-mmap
–flash-attn
“on”
–repeat-penalty
“1.0”
–sleep-idle-seconds
“300”
Other things I have done - also make sure to change your bios memory settings
Set bios to 512 MB
and change some kernel parameters so most of your 128GB memory can be utilized ( people say going higher than what is below causes crashes)
sudo grubby --update-kernel=ALL --args=‘ttm.pages_limit=27648000’
sudo grubby --update-kernel=ALL --args=‘ttm.page_pool_size=27648000’
sudo reboot
The last thing I still intend to do is turn immou off as a kernel parameter, people are reporting a 6% speed up by doing that as well
Why full-vulkan and not :full-rocm ? I’m trying to get qwen3-coder-next running so as to compare the throughput of glm-4.7 flash vs qwen3-coder-next as I have some repetitive tasks I want to run against them and need a decent throughput
I ended up going with vulkan as well as I had ran into issues with rocm when trying to use vllm. I found this pretty much worked as is with Qwen3-coder-next and the 4bit quantized version