Maxed out Minimax M2.5 on Framework Desktop

Hi, got it loaded :smiley:
Minimax M2.5 on Windows with Llama.cpp Vulkan edition

c:/llama.cpp.vk/llama-server.exe --host 0.0.0.0 --port 8123 --model .lmstudio\models\Unsloth\MiniMax-M2.5\MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf -c 16384 --keep 1024 --no-mmap --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 -fit on --context-shift

its the Unsloth UD-Q3_K_XL version with 16k Context and i got 22t/s in Open-WebUI (Python version on Windows) VRAM is 99GB and Ram is 25GB used

How does it perform and what use cases do you have for it? Thanks!

It performs significantly better than I expected… but I think since it’s only Q3, you have to expect some errors or even hallucinations. The context will also slow down considerably when it’s running at 42k… I think 16k is realistic though… it’ll be enough for “technical discussions,” but for coding I’ll definitely use Qwen3 Coder Next at q6_k_m … I actually just wanted to test the maximum possible on the Strix Halo.

Did you have to manually set VRAM for this to work? So far I’ve had no luck getting it to load because llama.cpp keeps maxing out at 64gb VRAM.

yes. normaly i have it @ 64gb, for this i had to set it to 96gb. and i was using the rocm Version of llama.cpp binaries before and now downloaded i the vulkan release… it took same time to get it running :wink: