How's your real-world AI (code generation) on the Framework Desktop?

I’m looking to step up my AI usage for software engineering, partly from some moderate pressure from work, and partly because my domain is changing and I probably need to embrace it to avoid becoming a dinosaur. I believe I have choices of Claude or Copilot at work, and these will be paid for me. Or, if I want to choose my own AI, that will probably be funded. I have ethical concerns about some of the major LLM providers so am wondering if I could run something usable myself.

I currently use free chat-only models from OpenAI and Meta via Duck AI, and they are very good. I’ve used those tools to ask how feasible it is to “run Claude locally” and have gotten conflicting answers. The first is that a desktop PC with a GPU and (critically) unified RAM architecture is a game-changer, and this will perform well. As readers will know here, that describes the Framework Desktop. The second view is that a good PC with a chunky £1,500 dedicated GPU will do a 10x speed-up, and there is no avoiding this cost if one wants to run models locally.

Interestingly AI assessed the Radeon™ 8060S Graphics (in the Framework Desktop) as fairly pedestrian, and not particularly helpful when doing complex reasoning or code generation tasks.

I have tried Llama and Codellama (7B) in CPU-only mode, and to be honest I was quite impressed with some basic code generation experiments, even it was rather slow on my ancient 32GB Dell laptop. However given that I have no unified arch nor impressive GPU I am not really able to do the more solid local testing.

I can see some forum discussion around what tokens/sec people are getting, but I wonder if I find that rather theoretical: what I am most interested in is whether local reasoning and code generation is feasible on Framework yet.

Related:

1 Like

There is quite some related info and good pointers in this thread:

I just got a FW Desktop and I am starting to get into this too (have no interest in gaming, just AI). Have been using coding agents/assistants for long now, and would like to have some of this running locally.

1 Like

Good finding, thanks! One that escaped my searching :zany_face:

I’ve read through that thread, and I am not sure I am seeing a categorical assertion that yes, local code gen is feasible on good hardware with current tooling. Maybe I am trying to distill an unavoidably complex topic into a binary answer!

I wondered if the lower-spec Framework Desktop, paired with this monster, would put some rocket-fuel in Ollama! I would normally go for a higher spec FW, but external graphics cards seem to be very expensive.

Update

From what I can tell, Framework Desktop is now only available with AMD, and this design does not support Thunderbolt. Thus it looks like FD will not realistically support external GPU cards.

Also related: Running ollama in docker on our Framework Desktop using the GPU

This post from not even a week ago seems highly relevant: MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo

In general you can take a look at https://strix-halo-toolboxes.com - a ton of useful information there.

Personally mostly use online models and have not tinkered with local LLMs for some time now. But the benchmarked toolboxes in the video seem promising.

1 Like

Super, I will watch that tomorrow; thanks! By coincidence I saw a trailer for that video in this one, which is very good.

For the sake of thread conversation: one of my thought processes is that I am not sure what ethical minefield one is jumping into by funding the tech-bro AI companies. I think we’re going to get a reckoning with forthcoming price-increases anyway; engineers who’ve been making hay with running sub-agents overnight may have to rein in their wild abandon when the subsidies run out.

That said, I will probably grab one of the least-worst options (Claude) and learn how to use it for a few months. I can see me pushing the button on local LLMs, and it would be a fun project to experiment with, but I wonder if there is an early-adopter time/price tax at present.

Oh boy. AI ethics are a minefield for sure. Negative impacts all around (energy, environment, workforce, etc.) and so much stupid hype over-promising and under-delivering. Not even going into what data models were trained on and impact on artists, and all creators really. On the other hand it’s something that is incredibly useful in certain applications already - with very good results). On some level reminds me of blockchain hype and real opportunities (e.g. cutting out traditional middleman in finance), with so much shady stuff going on and so many people getting fucked over.

About the early-adopter time/price tax. In software engineering LLMs are incredibly powerful. Claude Code, Cursor, Google Antigravity, Mistral Vibe, what have you don’t really have high hardware demands and are pretty easy to set up. Depending on how sensitive your data is you might want start learning on a dedicated machine. If at some point you think you’ll depend more on agentic coding going forward IMO it might be worth it to get a smooth local setup going. If for whatever reason the entry barrier for the big players gets bigger over night (heavy price increases, losing access, etc.).

The main reason why I currently don’t pursue a local setup is somewhat silly. It will definitely demand more performance of my Desktop than with my current regular use. I’m dreading the near constant fan noise that’s very likely to happen with it :slight_smile:

2 Likes

Good thoughts, all agreed.

I plan to set this up on an entirely separate computer - would be nice to use the Framework Desktop, or maybe a more traditional gaming rig with the funky glass side and glowing components neon-lit within. But it could be in a different room to my office - I’d just be accessing it over the LAN. Importantly I’d share my fibre connection IP on a dynamic DNS domain, and set up some kind of auth system, so I can access my LLM from outside the house too.

With that kind of setup your initial question is easy to answer: FW Desktop is more than capable enough to use any external code generation LLMs and usual software development workflows. Definitely also fine to tinker with local LLMs.

If you have a dedicated machine you also might look into Hermes Agent which is gaining a lot of popularity lately. Feels pretty similar to a locally run Claude Code and a bit less rough than OpenClaw.

Ah yes, perhaps I was not clear. My end purpose is in wanting to rely on local LLMs entirely, so as to negate the question of what killer-robots mania Karp and Musk etc are planning for humanity (and how I can avoid funding the same).

From my early research, running 70B and 100+B models is viable locally, and I’ve no doubt that small coding tasks like, say, reformatting a JSON array is possible. But for my next stage of AI competence, I want to be able to develop a whole application feature using local AI; I know remote systems can do it, but I want eventually to end my Claude subscription. This requires the LLM to hold a lot more context in memory, and to understand the structure of a large codebase, and I am not yet sure that local models can do that yet.

If you look into Hermes with a Qwen 3.6 model let us know about the performance :slight_smile: From my (little) research that seems like one of the best open source & locally run approaches at the moment.

1 Like

I’ve been using a Framework Desktop as an Ollama+OpenWebUI (+SearxNG+Tika) server, and for code generation I have Hermes Agent, Opencode, and Ollama running on a Framework 16 (AMD GPU).

The desktop Ollama runs the big models (generally quants of Gemma4, Qwen3.6, Laguna-XS in the 30-70GB size range), and laptop Ollama is used for the smaller exploratory models, embedding models, etc. This is easier to manage in Opencode than in Hermes, which seems to like using one model for everything - but Hermes can farm code-generation tasks out to Opencode. Smaller quants of the models on the desktop can be run on the laptop when travelling, should the need arise.

I don’t do benchmarking and haven’t recorded the tokens-per-second, as I just don’t care. The performance is acceptable. Big tasks can take many hours, sometimes overnight. Small tasks are done in a few minutes. Design discussion is basically real-time. The quality is generally pretty good: far, far superior to what I was getting without agents.

A side note on the use of OpenWebUI, as I inverted the usual approach. I defined a few Knowledge collections containing design documents, regulatory protocols, research papers, support tickets, etc. I then created Models based on ones that Hermes plays well with (Gemma4 and Qwen3.6) and assigned one or more Knowledge collections to them: this gave me Domain Experts. I made these models public and available via the OpenWebUI API, and added that as a provider in Hermes agent. Then created Hermes Profiles for each model, effectively making each Domain Expert model a sub-agent.

Huge productivity surge, accompanied by a huge productivity drain as I find more and more ways to customize HermesAgent and OpenCoder.

EDIT:

Totally possible using the setup described above. I’ve been seeing a 2-3 day turnaround time from spec to implementation for in-house software projects. Look into memory systems like Hindsight for use with Hermes, and look into how Hermes breaks a problem into subtasks and orchestrates sub-agents in order to handle the huge-codebase problem.

Frequently-used models on the desktop:

gemma4:31b 19 GB

qwen3.6:35b 23 GB

qwen3.6:27b 17 GB

qwen3.5:122b 81 GB 

llama4:16x17b  67 GB 

GLM-4.5-Air-LLM:latest  72 GB 

gemma4:26b 17 GB 

laguna-xs.2:Q8_0    37 GB 

gemma4:e4b  9.6 GB 

laguna-xs.2:Q4_K_M  23 GB

laguna-xs.2:BF16 66 GB 

The 122b Qwen model really pushes the system and takes a bit longer, but is generally worth it for in-depth code/architecture review. The Gemma 31b is a solid working model and good for code review, but the Gemma 26b and the Qwen 27b are my daily drivers (note: the Qwen 35b is actually a step down from the denser 27b). The Laguna BF16 is amazingly good, and the Q8 quant was my daily driver before using agents, but Opencode had problems with it and I have not tried it on Hermes yet. Hermes been busy.

One more point: the context window can really be the killer with these things. I found that Ollama serving three parallel requests of 256K context windows for Gemma4:31b was more than the desktop could handle. Current config is three parallel requests at 160K and it performs well, though for a big task I’d probably restart ollama with a single request at 256K.

1 Like

Some more on the non-technical side of this:

A major motivation for me to invest in some local AI capacity is that you can be sure the cloud services will see some hefty price increases in future.

Currently, the service providers sell their remote LLM/AI assistant products at a huge loss, what you pay now is basically dirt cheap, they are all making a big loss. The plan behind this is of course to trick everybody to get into this, and reduce their staff thinking they can save a lot of money this way. Later, when companies become dependent on the stuff, they will squeeze the life out of their customers for sure. AI is the optimal product for this, as it is even much harder to operate locally efficiently than traditional IT infrastructure (and you have recently seen big price hikes in this area as well, now that many companies have gotten rid of their hosted and in-house data centers, and rely on cloud providers 100%).

There have been some price hikes recently, that indicate the direction. But there will be much more to come, expect these services to increase in price by factors of 5-10 over the next 3-5 years. Then you will be happy to have managed to run some of the stuff independently of the US cloud tycoons. BTW: for remote AI products, some Chinese services offer huge price advantages over their US counterparts - wherever you need to rely on large LLM services, that is an option (although this is often not acceptable for companies, independent experts have little to no regulative/strategical barriers to China business).

So for me, this is an investment. We’ll see how well that works, I am optimistic. Anyhow this is IMO a no-brainer to at least invest some effort here.

1 Like

Thanks for your thoughts @edf - very useful.

This is a very interesting point. Currently I am only doing tasks with cloud AI where the response will come back in 30 seconds. I then incorporate that code and repeat; so this has to be fairly real-time in order to maintain a decent focus/workflow. I’ve heard of folks using Cursor/Claude projects where they let it cook overnight, but I am not sure yet how to guard against a wasted long run; I assume one would have to correct the prompt and then wait for another several hours.

How are you finding energy consumption? I am in the UK where energy is expensive (and I am environmentally minded enough that I’d want this to be efficient anyway). My AI-assisted design of a traditional Nvidia GPU PC showed that it would need a ~1200W power supply, though it’s interesting to see that Framework Desktop only needs 400W.

(I suspect my current use of AI is bursty, so my power consumption is likely currently negligible. But running GPUs for several hours would change that assumption.)

The area I live in has some of the highest electricity rates in the US, so I hear ya.

I don’t have it plugged into a Kill-a-Watt measurer , but there has been no notable increase in power usage. The release of Hermes coincided with warm weather, which makes it diffcult to say for sure, but the KWh on my power bill this month was lower than it’s been in two years so it can’t that bad :slight_smile:

BTW, I was experimenting with context windows last night, and found that the amount of memory used varies not only by quantization of the model, but also by the model type. Probably due to support for fast_attention in the model itself - Laguna in particular being notable for not compressing when the cache quantization method in Ollama becomes more aggressive.

I tweaked things a bit (amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856) this morning and now can load two parallel instances of either Laguna-xs.2:FP16 orQwen3.5:122b entirely in GPU memory using 256K context windows (well, Laguna caps out at 130K; I might alter its Modelfile and see how it behaves at 256).

The longer context models are also the ones that run longer, so getting them 100% in GPU memory should improve turnaround time a bit.

The question of context windows is an interesting one. I was recently playing around with the free tier of Thaura, a cloud LLM that promises an ethical constitution. I set it a puzzle relating to a game similar to Towers of Hanoi, involving a little text parsing, and it struggled so often I had to constructively give up. The default OpenAI model at Duck AI understood the problem with one or two nudges (and then wrote a working brute-force puzzle solver in JavaScript).

My assumption currently is that Thaura has a relatively small context window. It would accept a correction in its understanding, but after several more prompts, it would forget that correction, and then go off-piste again. A very frustrating game of whack-a-mole.


You might be positioned to answer this question for me, since you’re delving into the technicals: my research so far, much of it AI-powered itself, has not been straightforward, and some of it has been contradictory. That’s fine; it’s a complex and fast-moving topic. It looks like a chunky 24GB/32GB GPU card is thought to carry out code-gen tasks much faster than the Framework Desktop architecture. However, would you assume that such a GPU would also produce better results (e.g. improved reasoning, less hallucinations, more right-first-time results etc)? My limited understanding is that larger (more parameters) need to fit into GPU RAM, and thus the Framework Desktop places a limit on what can be achieved.

It may be that Framework Desktop is “good enough” for your purposes, and I consider it to be a more ethical/repairable purchase than a standard PC (even Nvidia can be seen as problematic). Unfortunately I have just discovered that a Thunderbolt GPU cannot be plugged into an AMD machine - it is only supported in Intel architectures (there are various workarounds detailed in this forum, but they look rather experimental). So I think one would have to find an internal GPU for the tiny space available, or make do with the arch as it comes.

Let’s not get too caught up in hardware selection. If you can justify the expense of 128GB worth of Nvidia GPUs and the rig or cluster needed to run them, knock yourself out - it will certainly outperform a Strix Halo or DGX Spark platform. The main appeal of these platforms is they allow you to run medium-sized (30-100b) models locally for a fraction of the cost of an equivalently-sized set of GPU cards, and without the orders-of-magnitude slowdown that comes with running entirely on CPU.

Likewise, let’s not concern ourselves with whether to run models in the cloud or locally. There are many arguments for running locally which have nothing to do with ethics or performance: my work, for example, cannot by contract be used on cloud servers, so use of cloud-based AI is out for anything but broad brainstorming. Of course, an interest in the underlying technology, a habit of tinkering, amd a drive towards simplicity are additional factors.

So we’re here and we’re talking about Framework desktops and the models which can be coerced to run on them (which seems to cap out at the MiniMax M2 series - I have not yet found a way to run them without resorting to egregiously low quantizations, or a confined context size).

Now, I know some pedant is going to come along and correct this no matter what I write, so let’s just dive right in. First up, number of parameters. Personal observation indicates that any model with at least 10b parameters contains the “copy of the internet” that makes it perform like a search engine. Reasoning requires additional parameters: the “better”/stronger the reasoning, the more parameters required, with reasoning starting to be usable at around 30b (try the various Olmo and QWQ sizes to observe the improvement). Tool use also requires additional parameters, with more parameters translating to more accurate tool selection and invocation, and again the 30b-sized models seem to be the practical minimum. There are model variants like Mixture of Experts (MoE) and “dense” models that can do more with less, but for simplicity let’s just say 30GB is a good minimum model size, and that anything larger than that should provide more reliable tool use and more cogent reasoning (not necessarily more knowledge, which is what is often assumed).

Next, context window. In your puzzle experiment, you encountered two things: a context window limit, and the context compaction/compression meant to compensate for that limit. The latter is important to know about, as every chat with an AI will eventually bring it into play. Briefly, there is a smaller model that takes the context window (the chat so far, all uploaded documents, etc) and condenses it to a summary text which will fit in the model’s context window. Depending on how this subsystem works, the conversation could degrade over time, as the model creates a summary of a summary of a summary, or the model discards one-time responses that are key to the problem at hand but which statistically appear unimportant (for example, because the user finds them so obvious they do not discuss them further).

The context window itself brings us back to the amount of GPU memory required. A 256K context window can require 25GB of GPU memory per instance - again, this is from observation, and it depends on both the quantization of the model and that of the OLLAMA cache (or equivalent in other software). I found this surprising, as it implies that 256-thousand tokens takes 25-billion bytes of memory or 100K per token, which just seems insane. So there is clearly more to context windows than I am aware of, but it’s a lot, and it has to be in the GPU memory just like the parameters. This is often expressed as, “you can have a large model or a large context window, but you cannot have both”, though with a unified-memory architecture like Strix or Spark you often can.

So, to take the long winding road back to your question: the size of the model before quantization determines the quality of the results it produces, and included in that size are the capabilities required to produce those results (reasoning, tool use, etc). Quantization will reduce that quality, but the general view seems to be that degradation only becomes noticeable at Q3 and lower - so it is better to run a Q4-quantized large model than a full-sized small model. The context window determines how much raw data you can give the model to work on. The choice of discrete GPU cards versus unified-memory architectures does not impact either: discrete cards will perform the computation faster, unified memory will allow larger models and context windows to be used.

My personal view on the matter is that AI is THE slowest way to solve a problem computationally. If you want a result computed quickly, use standard software. AI models save you time by reducing the actual work you need to do, not by being amazingly fast :slight_smile:

EDIT: While working on getting MiniMax Q3 working, I asked it about context windows, because attempting to search for implementation details on Google just got me four pages of the same AI-generated cursory overview.

Some highlights:

  • When you set num_ctx=131072, Ollama allocates memory for KV-caches
    across all possible layers, not just what’s needed for your current
    session

  • The KV-cache grows proportionally to: (layers × context_length × hidden_dim)

…hence the wildly different memory requirements for the same context size across different models.

3 Likes

I purchased the desktop as a server platform for LLM models (and other useful things). I just spent a few days over the weekend getting mine “generally” setup as I wanted.

To get what I consider a solid, performant setup was not trivial. I have written a full markdown document that walks through the process for my own reference. If anyone is interested, I can make it available.

I have the framework with the 128GB memory variant, of course.

Notes:

I set the bios to 64GB VRAM/64GB System RAM. To get the best performance out of llama.cpp it appears that being able to memory map the model files and when using a MOE model, not having enough system RAM can fail loading the model into VRAM. I tried a few at 96GB VRAM that could ‘technically’ fit, but really don’t.

Linux Distro: Arch Linux + CachyOS packages and kernel

I installed the latest Arch linux minimal system for a Server setup. Then I added CachyOS repositories and signing key (that was a headache btw) and updated my system packages and kernel to use the optimized CachyOS binaries. Arch linux (and most mainstream linux distros) do not compile the packages and kernel to take advantage or modern CPU x84 improvements (AVX-512, SIMD, etc). When you install the cachy binaries, the installer detects your CPU architecture and gives you the performance versions of the binaries for your system, not the generic versions. They also have a tuned kernel version for server. And their repo is generally on the cutting edge of releases – need for the latest package improvements in the LLM world.

This whole process was not trivial and using an AI helped, but this world is moving so fast you get a fair amount of outdated information from an AI query and have to dig to get current help. CachyOS is a bit of mixed bag of loosely coordinate changes, took me quite a while to find the correct keysign ring keys and repo URLs for pacman.

I tuned the systctl.conf for best LLM performance against some published guides. This also included some tuning kernel parameters specific to the AMD processor and memory configuration. Default arch ulimit values are way to low for llama.cpp and those had to be adjusted upward.

I git cloned llama.cpp and built the distro. You have to be careful to force the build to NOT build the ROCm support and use Vulkan instead. ROCm is not supported on the AMD 395+ yet. I tried to install ollama also and gave up. I’m sure I could find a docker image to make it easier, but the ollama install script and default distro binaries want to fallback to ROCm and then when it can’t find it, drops to the CPU. It’s not easy to get Vulkan up and running and I finally just moved on, llama.cpp is a bit more low-level to get models (manual download us hf - huggingface-cli) but works fine.

Once llama.cpp was up and running on a test model and I could confirm it loaded layers into the gpu (amdgpu_top is your friend), then I messed with a bunch of models. The only one that I failed to load was Qwen3-Coder-Next. The Q4_K_M version tells me that their is some sort of layer architecture that is not support on the Vulkan so only 4GB of 50GB loads into the GPU. Surprisingly even mostly running off the CPU I could still get 12-15t/s running a few simple prompts.

I haven’t dug deep into model choices yet, still experimenting. I’ve setup llama.cpp as a service and I’m planning to install Open Web UI on the machine to interact with the model. I’m also planning to try and load 2 models that will fit at the same time, so that I could have a model for general research and a model for agent operations.

GPT-OSS 120b surprised me; quite performant but it only loads about 7GB on the VRAM when optimally setup. It’s a MOE model and actually loading it completely on the VRAM can cause memory fragmentation and thrashing. There’s a llama.cpp setting to load the appropriate layers on the VRAM and leave the rest on system RAM. It threw off when it performed quite well t/s and I only saw it using like 7GB of VRAM.

I’m planning to use VSCODE plugins to talk to the API endpoint for agent coding from my Windows desktop. Yes, I know I’m still running Windows for my desktop environment (tried Omarchy, but I just don’t have the patience to relearn my whole ingrained methods – I’m a neovim guy but the tiling window manager has a steep onboarding curve). I also didn’t want to burden the framework with X/Wayland and eat part of the VRAM for that.

So far, I’m very impressed with my results. My desktop has a 4080 w/16G VRAM so I’ve been able to play with the smaller models, but this opens up some great new possibilities. At some point I need to “seat of pants” benchmark my 4080 and the framework using the same model.

Cheers

-Jeff

2 Likes

Ah, thanks for this! This has helped me appreciate why I got such conflicting AI advice; I think I was not particularly specific as to what I wanted, and to be honest, I probably didn’t have enough grasp on it to ask the right question accurately.

My understanding now:

  • Unified RAM architectures with some GPU are still better than CPU-only
  • Nvidia GPUs at 24-32GB are indeed x10 speedy, but this does not avoid the fact they can’t fit larger models into VRAM
  • With costs being equal, unified architectures will do a better job of reasoning than a Nvidia GPUs at a similar cost, because unified architectures will have at least double the amount of VRAM

With this trade-off in mind, I amused myself by looking for a Nvidia GPU with ~80GB VRAM. On eBay UK, the prices are all over the place (depending on whether they’re used/refurbed) but I’d guess the median cost is £9K (without a decent machine to run them on). Ouch :unamused_face: