Linux documentation to run Ollama or Llamacpp or vLLM?

Is there any documentation on steps to install Ubuntu or Fedora linux on one of these desktops where I can run models using Ollama, vLLM, llamacpp or anything else with the GPU?

What’s the state of NPU support for linux?

It’s still early times, so there are a kazillion ways to run models on your computer. On Linux, Alpaca, Ramalama, and Newelle are worth a look. Cross-platform apps like LM Studio and OpenWebUI work well too.

NPU and ROCm are basically unsupported on current Framework devices. (the soon-to-be released Desktop is getting ROCm, but at the moment Vulkan is faster). There’s hope on the NPU front: Status of AMD NPU Support - #23 by Dragonfire

1 Like

see Test ROCm with llama.cpp on Fedora Rawhide · Issue #7 · geerlingguy/beowulf-ai-cluster · GitHub

AMD not “supported” did not mean not possible to enable. Fedora43 have it enable and working (have best perf is an other story)

If you want to test it you can install fedora42 and use toolbox to test the 43 developments Rawhide both share the same kernel so the base driver will be OK. the rawhide have rocm-6.4 build with gfx115n build. SIGs/HC - Fedora Project Wiki
(on fedora it is activated too, but it it a rocm-4.3 to old for the IA-MAX)

llama.cpp is easy, just follow the build instructions for Vulkan (recommended) or HIP (less stable): llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub

I’ve some pretty comprehensive (but less well organized) working resources that I’ve been compiling over the past few months:

I’ve begun editing the AI section of a Strix Halo wiki as a shared doc that will hopefully be comprehensive at some point: https://strixhalo-homelab.d7.wtf/Guides/AI-Capabilities

If you need summary/help, personally, I’d recommend passing all these docs/links through a smart LLM of your choice to help tell you what to do.

If you’re looking for something relatively one-click, I’d recommend trying out https://lemonade-server.ai/ - it’s being maintained by a small team at AMD and they are moving very fast. They started on the NPU side (Windows-only still) but I’ve given some pointers on ROCm/llama.cpp stuff and they have built some great stuff now, like gfx110x, gfx1151 (this is Framework Desktop), and gfx120x (RDNA4) supporting builds of llama.cpp for Windows and Linux: GitHub - lemonade-sdk/llamacpp-rocm: Fresh builds of llama.cpp with AMD ROCm™ 7 acceleration (which is great, but Vulkan backend is generally faster/more stable - ROCm for gfx1151 is still very rough).

I’ll also point out GitHub - kyuz0/amd-strix-halo-toolboxes if you’re looking for prebuilt containers and an easy wrapper script with all the backends.

As for vLLM. I’ve been poking at that in my spare time. Current status is: GLWT

7 Likes

I think this repo is super helpful.

Just an update, I have gotten vLLM at least nominally running. It’s still not for the faint of heart, but it is at least possible now:

I made a dedicated thread for discussing PyTorch and vLLM on the Framework Desktop (Strix Halo): PyTorch w/ Flash Attention + vLLM for Strix Halo

3 Likes

Extremely disappointing to see AMD abandon Linux developers and have no support for their flagship APU.

Won’t trust AMD ever.

Support seems more “in progress” than unsupported. eg, they are doing nightly builds of llama.cpp w/ the latest ROCm for example: Releases · lemonade-sdk/llamacpp-rocm · GitHub

I’ve been maintaining some guides if you want to build your own:

(I will update the latter w/ PyTorch/vLLM pointers when I’m less busy).

The last overview lists the other available options. If you need/want 96GB+ of VRAM, then your next cheapest option atm is basically an Apple Mac Studio M4 Max ($3500) and if it comes out soon a DGX Spark for $3000-4000. You can get a 96GB RTX PRO 6000 for ~$8500. There’s also a new Huawei 310 Duo card that is <$2000 in China, but might be hard to get globally. I’ve played around w/ some Ascend hardware. There is a llama.cpp CANN backend and a vllm-ascend fork. (I haven’t touched the 310s at all. 910s were ok for production inference, but I didn’t play around w/ llama.cpp. Any support you’re likely to get would be in Chinese.)

7 Likes

There’s also a new Huawei 310 Duo card that is <$2000 in China

96GB ECC LPDDR4 at 408 GB/s? Wow, want!

That sort of memory bandwidth is what I wish we’d seen in the Framework Desktop. Why should Mac Studios have all the fun?

or correctly implement parallelism (ie tensor split across board) and use 2 or 4 MB :wink:

Is tensor parallelism still not implemented in llama.cpp? There was a lot of chatter about it last year, but I figured things might be different now…