AMD-specific Ollama Alternative?

Keyvan · August 10, 2025, 8:47am

Just seeing this:

Anyone has first hand experience? @geerlingguy would be cool if you can check it out, as you’re the cluster pro among us

wolfie · August 10, 2025, 9:00am

A note in case it is relevant to your planned use case: FasFlowLM requires paying license fees for commercial use.

KaRa · August 10, 2025, 9:29am

GitHub - lemonade-sdk/lemonade: Lemonade helps users run local LLMs with the highest performance by configuring state-of-the-art inference engines for their NPUs and GPUs. Join our discord: https://discord.gg/Z3u8tpqQ you have this which is even backed up by AMD, then you have https://www.amd.com/en/developer/resources/technical-articles/gaia-an-open-source-project-from-amd-for-running-local-llms-on-ryzen-ai.html for something apparently easier to use that I guess is just a wrapper on the former.

Disclaimer: I haven’t tried to use either do to lack of HW.

geerlingguy · August 10, 2025, 8:11pm

It looks like it requires Windows; so far I haven’t installed that on the cluster. To be honest, I’ve never clustered more than one Windows machine together before (outside of one time helping a company set up a MS SQL cluster with separate PHP servers running on Windows for a pretty crazy web project lol).

Still waiting for AMD to support Linux for the NPU, as that’s where all the more serious LLM work seems to take place…

Keyvan · August 10, 2025, 8:45pm

Wat, who they think they are, Scam Altman?

Keyvan · August 10, 2025, 8:48pm

Right, not worth it if it’s not Linux. I wonder if NPU adds meaningful improvement on Tps. Or everything is just LPDDR5-bandwidth-bound.

Fishd · August 11, 2025, 9:11am

I thought NPUs on devices were usually used for offloading the random AI stuff like removing/blurring the background in Teams calls, or noise cancellation?

i.e. Fire up the NPU rather than the GPU for light tasks.

Certainly Ollama on my Mac uses the GPU rather than Apples neural engine (though similarly, could be a limitation of what’s exposed in macOS).

Keyvan · August 11, 2025, 5:26pm

Theoretically NPUs should handle tensor processing (matmul in PyTorch, Tensorflow, the tech that runs open weights) more efficiently. I’m not sure either if that’s the case here. I.e. does ROCm (equivalent of CUDA) utilize these tensor cores or not.

lhl · August 12, 2025, 11:46am

On the Windows side AMD has been NPU-accelerating LLMs with GAIA and Lemonade.

Lemonade does work with Linux (using llama.cpp for Vulkan and ROCm support) btw but GPU only, and I think it’s worth trying over Ollama (besides their weird refusal to support Vulkan, which is the generally best easiest/most dependable backend for Strix Halo’s GPU atm, ollama also just continue to suck as an people) .

For other GPU options, if you don’t need it to be open source, I’ve found LM Studio to work fine in Windows and Linux, and if you’d rather have open source, Jan.ai is pretty decent as well and lets you select your llama.cpp backend from a variety of builds.

If you want to use the ROCm llama.cpp backend, (ROCm 7.0 + rocWMMA FA + sometimes hipBLASLt is usually (but not always) the fastest pp you can get on Strix Halo atm. I’ve thrown up a initial guide on that today, btw: https://strixhalo-homelab.d7.wtf/AI/llamacpp-with-ROCm

For Strix Halo, the GPU is faster than the NPU, and trying to tensor parallel between them would be … a challenge to say the least. If you’re going to use both, I think it only makes sense if you’re running different models on each (eg, ASR on the NPU and LLM on the GPU?).

Topic		Replies	Views
Status of AMD NPU Support Linux	23	6601	August 7, 2025
Linux documentation to run Ollama or Llamacpp or vLLM? Linux ubuntu	8	565	September 3, 2025
Framework 13 + Ryzen AI + Linux Distro + LLM Linux ubuntu , fedora	11	1665	July 7, 2025
Ollama - Framework 13 AMD? Framework Laptop 13	3	3968	June 30, 2024
Ollama on Framework 13 Linux bazzite	9	427	August 4, 2025

AMD-specific Ollama Alternative?

Related topics