AMD AI Max+ 395 128GB with cline

I also post the article in other website.

I’m asking for suggestions of run a LLM for cline agent coding since there’s not much info online and my GPT and Claude seems really not a reliable options to ask, I’ve view almost anything I can find and still concludes a definite answer.

I’m now in one of the framework desktop late batches and I wanna try out local LLM at then, I primarily use cline + gemini 2.5 flash for Unity/Go backend and occasionally for language likes rust, python typescripts etc if I feel like to code small tool for faster iterations

Would It feels worse in local server? And what model should I go for?

A local LLM server will feel much slower. Much slower.

As for which model you should run, if you’re disappointed with Claude, that might be difficult… It’s unlikely anything running locally can do much better.

How about trying some models on OpenRouter first?

They are all unreliable. AI tools are exactly that…tools. You still need to know what you are doing, I regularly correct any LLM I am using, but the upshot of this is that its responses improve over time, and even though I have to correct it, the LLM save me time in creating files that I can edit quickly that would take me a lot of time to create wholesale form scratch. No matter what you do though the human will still need to do the final heavy lift. I just get to do a lot more heavy lifting and a lot less pure grunt work.

I never said I don’t like Claude, just happen to have the needs to do ai inference offline

In fact, if I ever get a job in future, I’ll be consider buying pro yearly and using their services to code often.

That’s doesn’t quite answer my questions so I’ll be more clear.

IK they are unstable, but flash does copy my work and write the code in my structure and in the way I want it to

Does local ai have ability to achieve that? That is the question I wanna know.

Oh I get it what you mean now, what my article said about Claude is the chat I do for search with framework desktop in chat, I now use free version of their service.

MOE models like qwen-coder 30B-A3B and GPT-OSS-120B (and even -20B) should work well on Framework Desktop. I use qwen-coder mostly, because it’s very fast on my existing system (with RTX4090), but I really like gpt-oss-120b (although it’s chat format confuses Cline sometimes).

Would you be able to perform complex task with it?
Let’s say I have a very detail PRD with system design layout completed, would it follow?

There is a recent blog post from Cline about this that was pretty good: Cline + LM Studio: the local coding stack with Qwen3 Coder 30B - Cline Blog

They suggested using Qwen3 Coder 30B, and they have an option for a “compact prompt” to reduce context use. I’ve only started playing with a little, so I don’t have a good impression yet of how useful it can be; it’s definitely slower compared to cloud models, but not unusable.

Note that to get the model to load with full context size, I had to increase GTT limits (on Linux), as shown here: iGPU VRAM - How much can be assigned? - #7 by lhl

looks good, I’ll try it on my M4 32g and shared my thought later, ty

It’s hit and miss, frankly. In my experience, even Claude 4 struggles with large tasks, so I prefer breaking them down into smaller, localized chunks instead. I think gpt-oss follows the instructions better, but they both can stray away pretty quickly.

I know, I know
I just have trauma with copilot 4o myself
That thing is total miss and miss

M4 air can’t handle that lol, looks like the fan is still needed