Which language models are you using?

@Michael_Edward_Davis one of the new qwen3.5 models seems to be working well for me, just came out a couple days ago. So my current setup is opencode with oh-my-opencode and 3 different models in docker containers. I have one of these 3 options for the different agents that are in oh-my-opencode – small for easy stuff like explore agents, medium for planning, large for heavy coding

small - qwen3-4b-instruct
medium - qwen3.5-35B-A3B at Q5
large - qwen3-coder-next at Q4

in about 3 weeks worth of time I was able to recreate an app completely vibe coding (and I started with 0 knowledge of running local llms). A lot of that time was spent tinkering with settings, different models, learning how to use the agents, and even working around a lot of diff errors when these new models came out. If I had it all dialed in like I do now it prob would have took a week or less. I am quite impressed to be able to have made a working product using almost exclusively local models on 1 framework desktop. The GUI I created a prototype in about 15 prompts with the local model then fed it to Claude Opus 4.6 and told it to make me a more professional looking one… it gave me the end result with just 1 prompt

1 Like

one of the new qwen3.5 models seems to be working well for me, just came out a couple days ago.

I tried Qwen3.5-122B-A10B at Q6 on Linux using llama.cpp and it works fine for me with 128 GB of RAM after I applied the following tweaks:

  • The model needs 106 GB at Q6, so I had to increase the ttm.pages_limit kernel parameter from 25165824 (96 GiB) to 29360128 (112 GiB).

  • I use the vk_radv shim from the amd-vulkan-prefixesAUR package. This shim has been broken since February 2026, when a vulkan-radeon driver v26 dropped, which renamed /usr/share/vulkan/icd.d/radeon_icd.x86_64.json to radeon_icd.json. The shim doesn’t pick up that file name so I had to patch it.

The following server command line works for me:

Coding tasks:

vk_radv llama-server \
  -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q6_K_XL \
  -c 16384 -t 16 -cram 0 -np 1 \
  --min-p 0.00 --temp 0.6 --top-k 20

General tasks:

vk_radv llama-server \
  -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q6_K_XL \
  -c 16384 -t 16 -cram 0 -np 1 \
  --min-p 0.00 --temp 1.0 --top-k 20

How’s fast is token generation for 122B-A10B? I tried UD-Q4_K_XL on llama.cpp but the model must have gotten stuck in a loop because it didn’t finish loading after an hour. I’m wondering if it’s worth trying again at a different quantization.

On Linux using the RADV driver, UD-Q6_K_XL infers at ~ 17 tokens/s for me in llama.cpp-vulkan.

1 Like

Thanks, I’m going to have to look into using Vulkan. I get ~15.0 t/s using llama.cpp-hip (ROCm) on Linux.

i am using llama.ccp vulkan on Windows and getting 20-25t/s … dont forget to disable thinking ;o)

c:/llama.cpp.vk/llama-server.exe --host 0.0.0.0 --port 8123 --model C:\\Users\\Admin.lmstudio\\models\\mradermacher\\Qwen3.5-122B-A10B-heretic-i1-GGUF\\Qwen3.5-122B-A10B-heretic.i1-Q4_K_M.gguf --chat-template-kwargs “{“enable\_thinking”: false}” -c 81920 --keep 1024 --no-mmap --flash-attn on --cache-type-k q8\_0 --cache-type-v q5\_0 --context-shift --metrics --ubatch-size 3072 --batch-size 3072 --mmproj C:\\Users\\Admin.lmstudio\\models\\mradermacher\\Qwen3.5-122B-A10B-heretic-i1-GGUF\\Qwen3.5-122B-A10B-heretic.mmproj-f16.gguf