AMD-specific Ollama Alternative?

Just seeing this:

Anyone has first hand experience? @geerlingguy would be cool if you can check it out, as you’re the cluster pro among us :raising_hands:

A note in case it is relevant to your planned use case: FasFlowLM requires paying license fees for commercial use.

1 Like

GitHub - lemonade-sdk/lemonade: Lemonade helps users run local LLMs with the highest performance by configuring state-of-the-art inference engines for their NPUs and GPUs. Join our discord: https://discord.gg/Z3u8tpqQ you have this which is even backed up by AMD, then you have https://www.amd.com/en/developer/resources/technical-articles/gaia-an-open-source-project-from-amd-for-running-local-llms-on-ryzen-ai.html for something apparently easier to use that I guess is just a wrapper on the former.

Disclaimer: I haven’t tried to use either do to lack of HW.

1 Like

It looks like it requires Windows; so far I haven’t installed that on the cluster. To be honest, I’ve never clustered more than one Windows machine together before (outside of one time helping a company set up a MS SQL cluster with separate PHP servers running on Windows for a pretty crazy web project lol).

Still waiting for AMD to support Linux for the NPU, as that’s where all the more serious LLM work seems to take place…

1 Like

Wat, who they think they are, Scam Altman? :smiley:

Right, not worth it if it’s not Linux. I wonder if NPU adds meaningful improvement on Tps. Or everything is just LPDDR5-bandwidth-bound.

1 Like

I thought NPUs on devices were usually used for offloading the random AI stuff like removing/blurring the background in Teams calls, or noise cancellation?

i.e. Fire up the NPU rather than the GPU for light tasks.

Certainly Ollama on my Mac uses the GPU rather than Apples neural engine (though similarly, could be a limitation of what’s exposed in macOS).

Theoretically NPUs should handle tensor processing (matmul in PyTorch, Tensorflow, the tech that runs open weights) more efficiently. I’m not sure either if that’s the case here. I.e. does ROCm (equivalent of CUDA) utilize these tensor cores or not.

On the Windows side AMD has been NPU-accelerating LLMs with GAIA and Lemonade.

Lemonade does work with Linux (using llama.cpp for Vulkan and ROCm support) btw but GPU only, and I think it’s worth trying over Ollama (besides their weird refusal to support Vulkan, which is the generally best easiest/most dependable backend for Strix Halo’s GPU atm, ollama also just continue to suck as an people) .

For other GPU options, if you don’t need it to be open source, I’ve found LM Studio to work fine in Windows and Linux, and if you’d rather have open source, Jan.ai is pretty decent as well and lets you select your llama.cpp backend from a variety of builds.

If you want to use the ROCm llama.cpp backend, (ROCm 7.0 + rocWMMA FA + sometimes hipBLASLt is usually (but not always) the fastest pp you can get on Strix Halo atm. I’ve thrown up a initial guide on that today, btw: https://strixhalo-homelab.d7.wtf/AI/llamacpp-with-ROCm

For Strix Halo, the GPU is faster than the NPU, and trying to tensor parallel between them would be … a challenge to say the least. If you’re going to use both, I think it only makes sense if you’re running different models on each (eg, ASR on the NPU and LLM on the GPU?).

3 Likes