I am enjoying exploring the capabilities of the Framework Desktop in regards to running local LLMs.
My use cases so far have been related to the development of simple applications which require embeddings and chat completions, examples include experimenting with Semantic Kernel and building a bespoke prompt evaluation pipeline.
I run my models using LM Studio on Windows 11 and use either the ROCm llama.cpp or Vulkan llama.cpp runtime, the choice depending on stability of the latest versions of both (for example, I attempted to load openai/gpt-oss-120b on ROCm this morning and it fails, this has previously worked).
The largest model I have run which delivered acceptable performance is openai/gpt-oss-120b.
A model I was keen to try which hasn’t performed so well was mistralai/devstral-2-2512. However, mistralai/devstral-small-2-2512 delivers acceptable performance.
I am curious to know what models people are using and what those models are used for, looking for inspiration for my next project to make the most of this APU.
Which models are you using and how do they perform?
I don’t have the fw desktop, but I am trying the gpt-oss:20b on my “gaming” desktop and using it as my soon-to-be Alexa replacement in home assistant. So far it’s pretty awesome.
Alexa to this day can not keep context in a conversation and this local LLM can. It’s quite amazing.
I have mainly used the unsloths version of the gpt-oss-120b on my FW Desktop. Even the unoptimised gpt-oss-120b fits in to the memory, unsloth takes about 70gigs and still I think its faster on prompt processing etc.
Need to start looking for little bit bigger models too, but haven’t had that much time to check on those.
I’m mainly using LM Studio, and ironically, mostly the Nvidia Nemotron model. I’ve only used Vulkan as I’ve never seen ROCm actually work—like a lot of AMD software stacks, it is under severe development and probably won’t be use “set it and forget it” stability for years.
I was using Arch when AMD released the Radeon HD6000 cards back 10 or 15 years ago. I remember because AMDCCCLE had zero XOrg support for those cards nearly a year after release. Could only run terminal, no GUI.
I have been able to run rocm on Windows only while Linux (Fedora, cachyOs) only allowed me to run vulkan. On gpt-oss 120b, performances were about 50to53 t/s on both on new hi prompts. There is no clear winner to me yet
GLM-4.7-Flash just came out and so far my experience with it has been really good. My setup is running llama.cpp and openwebui in podman containers. Running the unsloth UD-Q8_K_XL variant and utilizing it with opencode. Speed and expertise of the model seem to be really good for a local solution and its size.
I had a lot of trouble tuning my settings to get an acceptable speed; it took me 2 days to dial it in. So if anyone wants to try – here is my podman-compose. I am running Fedora 43 Workstation w/ podman on AI Max+ 395 - 128GB config.
Thanks for sharing those specs @Randy_Queen. You just saved me from going mad all day. I couldn’t get GLM to be performant. Using lm studio and noticed improvement on the ROCm unsloth but I was getting gMASKgMASK. Using opencode which sits in my docker container and I shell into it to run it. On Windows here. OSS-120b runs fast but it cannot code very well and it seems lazy to me. Cuts corners. But your settings magically made it work better than the free version on opencodes cloud offering! Cheers
You can make a backup of those settings, I seem to have gotten it to go a little faster with some help from people giving pointers on Reddit; opencode is performing better for me now. I updated my post with the latest!
@Randy_Queen thanks so much for the share! I took your compose file and added a bit to it. Most importantly I changed it to use the vulkan backend for llama.cpp so it uses the GPU for inference, not the CPU. Perf is really good! I think you could legit replace Claude Code with this in terms of speed of generation, etc. (haven’t used it enough to tell whether the underlying model is as effective as Anthropics, but that’s a different story).
The README.md contains all the details, although its basically just a docker/podman compose up -d
EDIT: Okay used it for a few good hrs, not quite fast enough to replace Claude Code for every day use haha, but still surprisingly effective!
@ndom91 I used your docker-compose.yml file as a basis for running llama.cpp on Windows.
I hit a few issues with Docker + llama.cpp + Windows + Vulkan and switched to running llama.cpp directly on the host, installed via Winget.
My first time running models on the llama.cpp library directly, I have previously used LM Studio or Ollama to simplify the setup. Your script helped to get me up and running quickly, thanks!
Just testing GLM-4.7-Flash-UD-Q8_K_XL.gguf, using your startup parameters.
Hitting the completions endpoint via Postman, it was incredibly slow with a basic “Create a Hello World” prompt.
I reduced the context window size to a more reasonable 16384 and it performed much better.
I want to get this hooked up to whatever CLIs and plugins support llama.cpp, to see how it performs on some real use cases.
I am using Fedora 43 as a base OS, running kyuz0’s vulkan docker image as an execution environment for running llama-server (which has built in openweb ui) with q4 models from unsloth. My favorite three so far are Nemotron 30b, GLM 4.7 Flash, and Qwen3 coder 30b - getting around 50-60 tops on all three and have them maxed out for context windows (Nemotron supports up to 1M context window). Use systemd services to start them all up at boot and have about 10G of memory left with all three loaded - have not run into OOM issues (yet).
Tried gemma3 and could not get more than 10-12 tops. Loaded gpt-oss-120b and like you, was not super happy with the quality.
My go to right now with claude code is Nemotron. For a local setup, very usable and all of my skills/tasks/workflows just work. You do know you can just point claude to your llama instance directly, right? No need to use opencode unless you want pure OSS.