Which language models are you using?

Dean_0 · January 16, 2026, 1:33pm

I am enjoying exploring the capabilities of the Framework Desktop in regards to running local LLMs.

My use cases so far have been related to the development of simple applications which require embeddings and chat completions, examples include experimenting with Semantic Kernel and building a bespoke prompt evaluation pipeline.

I run my models using LM Studio on Windows 11 and use either the ROCm llama.cpp or Vulkan llama.cpp runtime, the choice depending on stability of the latest versions of both (for example, I attempted to load openai/gpt-oss-120b on ROCm this morning and it fails, this has previously worked).

The largest model I have run which delivered acceptable performance is openai/gpt-oss-120b.

A model I was keen to try which hasn’t performed so well was mistralai/devstral-2-2512. However, mistralai/devstral-small-2-2512 delivers acceptable performance.

I am curious to know what models people are using and what those models are used for, looking for inspiration for my next project to make the most of this APU.

Which models are you using and how do they perform?
What use cases are these models employed for?
What tooling are you using to run the models?

Thanks!

jared_kidd · January 16, 2026, 2:44pm

I don’t have the fw desktop, but I am trying the gpt-oss:20b on my “gaming” desktop and using it as my soon-to-be Alexa replacement in home assistant. So far it’s pretty awesome.

Alexa to this day can not keep context in a conversation and this local LLM can. It’s quite amazing.

inffy · January 16, 2026, 3:51pm

I have mainly used the unsloths version of the gpt-oss-120b on my FW Desktop. Even the unoptimised gpt-oss-120b fits in to the memory, unsloth takes about 70gigs and still I think its faster on prompt processing etc.

Need to start looking for little bit bigger models too, but haven’t had that much time to check on those.

Tactical_Finesse · January 16, 2026, 6:04pm

I’m mainly using LM Studio, and ironically, mostly the Nvidia Nemotron model. I’ve only used Vulkan as I’ve never seen ROCm actually work—like a lot of AMD software stacks, it is under severe development and probably won’t be use “set it and forget it” stability for years.

I was using Arch when AMD released the Radeon HD6000 cards back 10 or 15 years ago. I remember because AMDCCCLE had zero XOrg support for those cards nearly a year after release. Could only run terminal, no GUI.

inffy · January 16, 2026, 6:43pm

ROCm has worked really well on the Desktop, until the new linux-firmware package came and destryoed everything

Dean_0 · January 16, 2026, 8:43pm

This is a great idea! I What software are you using? And what features are you implementing?

Dean_0 · January 16, 2026, 8:48pm

Thanks for pointing me to Unsloth!

I noticed models in the LM Studio catalog from the “unsloth” repository but hadn’t explored any further.

I can see they have quite a few on Hugging Face.

On first glance, it looks like their tooling helps to optimise models for running locally, using less resources and increasing performance?

Dean_0 · January 16, 2026, 8:51pm

Over the last few months I have seen instability across both Vulkan and ROCm.

Vulkan worked well, then failed. Then ROCm worked well, and then so did Vulkan and now ROCm is having issues.

I am not conviced either are “set it and forget it” just yet.

When they do work both appear to perform well, no specific numbers to report as I haven’t measured the difference.

Do you think one performs better than the other on the Desktop?

Silph · January 16, 2026, 9:26pm

I have been able to run rocm on Windows only while Linux (Fedora, cachyOs) only allowed me to run vulkan. On gpt-oss 120b, performances were about 50to53 t/s on both on new hi prompts. There is no clear winner to me yet

Tactical_Finesse · January 16, 2026, 10:42pm

Odd. I’ve not seen ROCm work once in the last month and a half.

Djip · January 16, 2026, 10:53pm

work (not perfect) with latest: ROCm is very sensitive to kernel version · Issue #45 · kyuz0/amd-strix-halo-toolboxes · GitHub

Silph · January 17, 2026, 8:49pm

it was around christmas i think but i retried today and it did not work indeed…

Rand_o · January 23, 2026, 2:49am

GLM-4.7-Flash just came out and so far my experience with it has been really good. My setup is running llama.cpp and openwebui in podman containers. Running the unsloth UD-Q8_K_XL variant and utilizing it with opencode. Speed and expertise of the model seem to be really good for a local solution and its size.

I had a lot of trouble tuning my settings to get an acceptable speed; it took me 2 days to dial it in. So if anyone wants to try – here is my podman-compose. I am running Fedora 43 Workstation w/ podman on AI Max+ 395 - 128GB config.

version: "3.8"

services:
  llama-cpp:
    image: ghcr.io/ggml-org/llama.cpp:full
    container_name: llama-cpp-server
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models:Z
    command:
      - --server
      - --host
      - "0.0.0.0"
      - --port
      - "8080"
      - --model
      - /models/GLM-4.7-Flash-UD-Q8_K_XL.gguf
      - --temp
      - "0.2"
      - --top-p
      - "0.95"
      - --min-p
      - "0.01"
      - --flash-attn
      - "on"
      - --sleep-idle-seconds
      - "300"
    restart: unless-stopped


  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - ./data:/app/backend/data:Z
    environment:
      - OPENAI_API_BASE_URL=http://llama-cpp:8080/v1
      - OPENAI_API_KEY=sk-no-key-required
      - WEBUI_AUTH=false
    depends_on:
      - llama-cpp
    restart: unless-stopped

networks:
  default:
    name: llm-network

Guest534 · January 23, 2026, 11:07pm

Thanks for sharing those specs @Rand_o. You just saved me from going mad all day. I couldn’t get GLM to be performant. Using lm studio and noticed improvement on the ROCm unsloth but I was getting gMASKgMASK. Using opencode which sits in my docker container and I shell into it to run it. On Windows here. OSS-120b runs fast but it cannot code very well and it seems lazy to me. Cuts corners. But your settings magically made it work better than the free version on opencodes cloud offering! Cheers

Rand_o · January 23, 2026, 11:40pm

You can make a backup of those settings, I seem to have gotten it to go a little faster with some help from people giving pointers on Reddit; opencode is performing better for me now. I updated my post with the latest!

ndom91 · January 24, 2026, 4:37pm

@Rand_o thanks so much for the share! I took your compose file and added a bit to it. Most importantly I changed it to use the vulkan backend for llama.cpp so it uses the GPU for inference, not the CPU. Perf is really good! I think you could legit replace Claude Code with this in terms of speed of generation, etc. (haven’t used it enough to tell whether the underlying model is as effective as Anthropics, but that’s a different story).

The README.md contains all the details, although its basically just a docker/podman compose up -d

EDIT: Okay used it for a few good hrs, not quite fast enough to replace Claude Code for every day use haha, but still surprisingly effective!

Dean_0 · January 25, 2026, 11:35am

Thanks @Rand_o for highlighting GLM-4.7-Flash to me, I had not seen this model before.

Thanks for sharing your GitHub repository @ndom91, this looks like a great quickstart resource.

What other models do you guys think work well for local coding?

ndom91 · January 25, 2026, 1:21pm

Haven’t tried any others via opencode, curious for other input as well

Dean_0 · January 25, 2026, 1:32pm

@ndom91 I used your docker-compose.yml file as a basis for running llama.cpp on Windows.

I hit a few issues with Docker + llama.cpp + Windows + Vulkan and switched to running llama.cpp directly on the host, installed via Winget.

My first time running models on the llama.cpp library directly, I have previously used LM Studio or Ollama to simplify the setup. Your script helped to get me up and running quickly, thanks!

Just testing GLM-4.7-Flash-UD-Q8_K_XL.gguf, using your startup parameters.

Hitting the completions endpoint via Postman, it was incredibly slow with a basic “Create a Hello World” prompt.

I reduced the context window size to a more reasonable 16384 and it performed much better.

I want to get this hooked up to whatever CLIs and plugins support llama.cpp, to see how it performs on some real use cases.

0xdeadbeef · January 25, 2026, 3:40pm

I am using Fedora 43 as a base OS, running kyuz0’s vulkan docker image as an execution environment for running llama-server (which has built in openweb ui) with q4 models from unsloth. My favorite three so far are Nemotron 30b, GLM 4.7 Flash, and Qwen3 coder 30b - getting around 50-60 tops on all three and have them maxed out for context windows (Nemotron supports up to 1M context window). Use systemd services to start them all up at boot and have about 10G of memory left with all three loaded - have not run into OOM issues (yet).

Tried gemma3 and could not get more than 10-12 tops. Loaded gpt-oss-120b and like you, was not super happy with the quality.

My go to right now with claude code is Nemotron. For a local setup, very usable and all of my skills/tasks/workflows just work. You do know you can just point claude to your llama instance directly, right? No need to use opencode unless you want pure OSS.

Topic		Replies	Views
[TRACKING] Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b? Framework Desktop framework-desktop-ai-max-300 , ai	35	15053	January 25, 2026
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	9754	February 2, 2026
AMD AI Max+ 395 128GB with cline Framework Desktop ai	14	1670	September 5, 2025
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	19465	September 29, 2025
Framework 13 + Ryzen AI + Linux Distro + LLM Linux ubuntu , fedora	20	4603	February 11, 2026

Which language models are you using?

Related topics