Linux documentation to run Ollama or Llamacpp or vLLM?

Atul_C · August 8, 2025, 7:18pm

Is there any documentation on steps to install Ubuntu or Fedora linux on one of these desktops where I can run models using Ollama, vLLM, llamacpp or anything else with the GPU?

What’s the state of NPU support for linux?

bron · August 8, 2025, 7:47pm

It’s still early times, so there are a kazillion ways to run models on your computer. On Linux, Alpaca, Ramalama, and Newelle are worth a look. Cross-platform apps like LM Studio and OpenWebUI work well too.

NPU and ROCm are basically unsupported on current Framework devices. (the soon-to-be released Desktop is getting ROCm, but at the moment Vulkan is faster). There’s hope on the NPU front: Status of AMD NPU Support - #23 by Dragonfire

Djip · August 9, 2025, 10:45am

see Test ROCm with llama.cpp on Fedora Rawhide · Issue #7 · geerlingguy/beowulf-ai-cluster · GitHub

AMD not “supported” did not mean not possible to enable. Fedora43 have it enable and working (have best perf is an other story)

If you want to test it you can install fedora42 and use toolbox to test the 43 developments Rawhide both share the same kernel so the base driver will be OK. the rawhide have rocm-6.4 build with gfx115n build. SIGs/HC - Fedora Project Wiki
(on fedora it is activated too, but it it a rocm-4.3 to old for the IA-MAX)

lhl · August 10, 2025, 5:00am

llama.cpp is easy, just follow the build instructions for Vulkan (recommended) or HIP (less stable): llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub

I’ve some pretty comprehensive (but less well organized) working resources that I’ve been compiling over the past few months:

I’ve begun editing the AI section of a Strix Halo wiki as a shared doc that will hopefully be comprehensive at some point: https://strixhalo-homelab.d7.wtf/Guides/AI-Capabilities

If you need summary/help, personally, I’d recommend passing all these docs/links through a smart LLM of your choice to help tell you what to do.

If you’re looking for something relatively one-click, I’d recommend trying out https://lemonade-server.ai/ - it’s being maintained by a small team at AMD and they are moving very fast. They started on the NPU side (Windows-only still) but I’ve given some pointers on ROCm/llama.cpp stuff and they have built some great stuff now, like gfx110x, gfx1151 (this is Framework Desktop), and gfx120x (RDNA4) supporting builds of llama.cpp for Windows and Linux: GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration (which is great, but Vulkan backend is generally faster/more stable - ROCm for gfx1151 is still very rough).

I’ll also point out GitHub - kyuz0/amd-strix-halo-toolboxes if you’re looking for prebuilt containers and an easy wrapper script with all the backends.

As for vLLM. I’ve been poking at that in my spare time. Current status is: GLWT

DocBrody · September 2, 2025, 6:24pm

I think this repo is super helpful.

lhl · September 3, 2025, 3:16am

Just an update, I have gotten vLLM at least nominally running. It’s still not for the faint of heart, but it is at least possible now:

This was only tested on ROCm/TheRock nightly builds for ROCm. I’d suggest using the latest one: TheRock/RELEASES.md at main · ROCm/TheRock · GitHub
TheRock PyTorch did not work for me. You can reference TheRock’s external build scripts: TheRock/external-builds/pytorch at main · ROCm/TheRock · GitHub but I had to do a bunch of my own work here: strix-halo-testing/torch-therock at main · lhl/strix-halo-testing · GitHub - this is a script that WFM, but there are a lot of moving parts so you probably need to put some elbow grease in
Then there’s building vLLM itself. Note that if you use TheRock version of PyTorch it segfaults immediately atm so you can’t skip the previous step of building your own torch. Even then I found some models don’t run but I didn’t extensively test what worked and what didn’t. Note, these scripts are rougher but basically serve as documentation for how you would in principle get vLLM working: strix-halo-testing/vllm at main · lhl/strix-halo-testing · GitHub

I made a dedicated thread for discussing PyTorch and vLLM on the Framework Desktop (Strix Halo): PyTorch w/ Flash Attention + vLLM for Strix Halo

Atul_C · September 3, 2025, 3:31am

Extremely disappointing to see AMD abandon Linux developers and have no support for their flagship APU.

Won’t trust AMD ever.

lhl · September 3, 2025, 3:44am

Support seems more “in progress” than unsupported. eg, they are doing nightly builds of llama.cpp w/ the latest ROCm for example: Releases · lemonade-sdk/llamacpp-rocm · GitHub

I’ve been maintaining some guides if you want to build your own:

Building llama.cpp w/ ROCm: https://strixhalo-homelab.d7.wtf/AI/llamacpp-with-ROCm
llama.cpp performance: llamacpp-performance – Strix Halo HomeLab
Strix Halo AI overview/guide: https://strixhalo-homelab.d7.wtf/AI/AI-Capabilities-Overview

(I will update the latter w/ PyTorch/vLLM pointers when I’m less busy).

The last overview lists the other available options. If you need/want 96GB+ of VRAM, then your next cheapest option atm is basically an Apple Mac Studio M4 Max ($3500) and if it comes out soon a DGX Spark for $3000-4000. You can get a 96GB RTX PRO 6000 for ~$8500. There’s also a new Huawei 310 Duo card that is <$2000 in China, but might be hard to get globally. I’ve played around w/ some Ascend hardware. There is a llama.cpp CANN backend and a vllm-ascend fork. (I haven’t touched the 310s at all. 910s were ok for production inference, but I didn’t play around w/ llama.cpp. Any support you’re likely to get would be in Chinese.)

bron · September 3, 2025, 4:27pm

There’s also a new Huawei 310 Duo card that is <$2000 in China

96GB ECC LPDDR4 at 408 GB/s? Wow, want!

That sort of memory bandwidth is what I wish we’d seen in the Framework Desktop. Why should Mac Studios have all the fun?

Djip · October 29, 2025, 11:15am

or correctly implement parallelism (ie tensor split across board) and use 2 or 4 MB

Lincoln_Chen · October 30, 2025, 2:29am

Is tensor parallelism still not implemented in llama.cpp? There was a lot of chatter about it last year, but I figured things might be different now…

Topic		Replies	Views
AMD-specific Ollama Alternative? Framework Desktop	8	4020	August 12, 2025
Quickstart Guide: Ollama With GPU Support (No ROCM Needed) Linux ubuntu	3	10632	January 21, 2026
Status of AMD NPU Support Linux	34	15685	February 28, 2026
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	57	11392	June 21, 2026
Framework 13 + Ryzen AI + Linux Distro + LLM Linux ubuntu , fedora	20	5081	February 11, 2026

Linux documentation to run Ollama or Llamacpp or vLLM?

Related topics