Very interesting. I saw some online commentary that ~80GB VRAM / ~48GB system was worth trying, on the basis that holding a larger model is important, but it still needs to be “fed” by the main system (and so one cannot reduce the system down to a negligible level in order to max out the VRAM). Did you try any other ratios?
If your goal is code generation and software engineering productivity, local models on the Framework Desktop are definitely usable, especially with Strix Halo’s large unified memory. The main advantage is being able to run larger models (30B–70B class quantized models) that wouldn’t fit on many discrete GPUs with limited VRAM.
That said, for pure coding performance, Claude and GPT-class cloud models are still significantly ahead of what most local models can deliver today. A £1,500 GPU isn’t strictly required to run local models, but it will generally provide faster inference than the integrated Radeon 8060S.
My recommendation would be: use the employer-funded Claude or Copilot for day-to-day work, and experiment with local models if privacy, control, or learning is important to you. Local AI is already practical for coding assistance, but it’s not yet a full replacement for the top hosted models.
My recommendation would be: use the employer-funded Claude or Copilot for day-to-day work, and experiment with local models if privacy, control, or learning is important to you. Local AI is already practical for coding assistance, but it’s not yet a full replacement for the top hosted models.
Yes, good thoughts: I will do that in the short term. My brain has decided to have pressing ethical reasons to move away from cloud AI, so I could not tarry forever; I believe I would accept a worse reasoning or speed for a perceptible ethical improvement. (How much worse, of course, is the big question - it needs still to be useful).
Part of my ethical view is the withdrawal of funding: it’s not merely that I don’t want to fund this or that thing, but I also don’t want my employer to do so on my behalf. My ethical stance demands that I remove the funding from my seat, even if someone will take the burden off me.
In the short term, I acknowledge that Anthropic are not too bad, so I can hang on for ~six months if necessary. I think I will order the top-spec FW Desktop in a month or so, when I shall be around to receive the parcel, but of course I will keep researching. There are spotty reports of hardware QC issues, so I will dig into those.
As a broad aside, I can see two different ways in which AI code gen can work:
The addition of a feature is tackled, commit by commit, as one would code manually. Today I did a piece of work in three hours that I think would have taken me over a day previously. A demo bit of data in a backend web controller, a placeholder SVG here, some routing there, some CSS reformatting, some database lookups, all on an iterative basis. That’s ~40 commits, but without much of the old cognitive overhead (e.g. the finer points of the grid layout system in CSS).
A very complex and comprehensive feature requirement, plus AGENTS.md files, to shape how the feature should be made in a one-shot attempt. This is the sort of AI that takes a few hours to run. I am not here yet, and maybe I won’t ever be.
I like the first option because each commit is quick - 100 to 6K tokens apiece. There is minimal waiting, and there is no so much waiting that I lose focus. The second option is interesting, but there are no intermediate results, and if it gets something wrong, one might have to spend another few hours on another attempt.
My hope is that option one requires less reasoning, and thus it is within the bounds of the locally runnable models; not as good as the frontier models, I warrant, but perfectly good enough for my workflow. Plus, as a bonus, I still feel that the direction of development for an artifact is still under my control.
I never tried 80/48GB split. The Bios doesn’t directly support that split, so I think you would need the BIOS in auto mode and then allocate in linux somehow (haven’t researched it). I know people often use the Auto mode allocation in Bios, but somewhere I read that this can slow the initial loading of models.
80/48 could be useful for loading a larger say ~40GB model, and then have plenty of overhead in VRAM for a large KV cache and context length.
I’m cleaning up my system documentation. It will document the whole process for basic install, tuning and installing the useful WebUI and TUI and docker services I’m using. Includes Caddy for reverse proxy, llama.cpp, llama-swap, HF Downloader, and anything else I find useful in my exploration.
Your answer raises a basic question I’ve never seen answered definitely: How do you get ROCm on the Strix Halo 395+ installed? Your link contains a compatibility matrix for ROCm, unless it’s out of date… this iGPU is not support: Radeon 8060s.
I have yet to find anyone who can tell me how to get ROCm installed on this platform without resorting to experimental 3rd-party drivers and configuration.
And “not recommend on linux” – where does this recommendation come from? The same non-supported ROCm link you provided?
I’d love to get ROCm installed and working to try it vs Vulkan; but I have yet to find any successful information on setting this up on the framework desktop.
If you have a link, I’d appreciate if you posted it.
Djip, thanks for pointing this out. I will look into it. I’m not an AI/LLM researcher, so if ROCm gives me slightly better t/s rates than Vulkan, I have to balance the effort to get it running vs performance improvements. Vulkan so far has been quite good for my needs and stable with pretty much all the models I’ve played with so far. But I appreciate the updated information. I also note that it’s only been validated on Ubuntu using PyTorch with FP16 models. So I’m curious if it can be integrated and used by llama.cpp and support other quantizations. And specifically, that documentation says:
Lower than expected performance may be observed while running some LLM workloads (such as Llama 31B/3B) on AMD Ryzen™ AI MAX+395 processors.
I have come across the hipEngine project before, I have not tried to set it up. To be clear, I’m hesitant to deep dive into a 3rd-party implementation. I’m willing to be on the edge of the development for LLM, but only when it’s stable. If anyone can share experience with hipEngine, I’m interested to see what your results have been, but I’ve got limited time to fiddle with alpha-level versions of software.
But thanks and keep the information flowing, I’m always looking.
TheRock looks interesting, but I don’t have a real need for PyTorch use at this time. I do need some TensorFlow backend in the near future, but I haven’t even started doing the research for that project.