Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo

kyuz0 · September 4, 2025, 2:21pm

I just wanted to share the toolboxes I’ve been working on together with folks at https://strixhalo-homelab.d7.wtf/.

The idea is to provide ready-to-use environments for LLM inference tailored at Strix Halo, together with configuration tips.

These all require you to install a recent Linux distribution, I recommend Fedora 42.

Llama.cpp Toolboxes

These are currently the best option for running many different LLMs on the Framework Desktop. You will see many toolboxes with different backends, right now I’d choose amdvlk or rocm-6.4.3-rocwmma. The repo has a link to a YouTube video that explains all the setup to get this going quickly.

vLLM Toolboxes / Container (WIP / EXPERIMENTAL)

This is still WIP and vLLM on Strix Halo is PITA at the moment, appreciate any contributions to support more models, right now this requires a lot of patching and still you can only run a subset of models.

Hope you find this helpful!

aquarat · September 10, 2025, 1:48pm

I’ve been using your llama.cpp repo/Dockerfiles (specifically rocm-7rc-rocwmma with extra bits to build the RPC server) and it works really well. Thank you

kyuz0 · September 10, 2025, 4:51pm

Thank you, glad you’re finding it useful! ROCm support will also be improving - so hopefully we can squeeze even more performance out of this!

Lars_Urban · September 27, 2025, 5:05pm

Thank you for this perfect starting point! A silly question: you chose Fedora Workstation. Isn’t that also a full-fledged desktop environment (only with Gnome)? At this point, I tested it with the KDE Plasma desktop. Did I overlook something? Does it consume too many resources to fully utilize my 128 GB ?

Thank you in advance

Mario_Limonciello · September 27, 2025, 5:38pm

If you haven’t already; upgrade to rocm 6.4.4, it has optimizations specifically for Strix Halo.

kyuz0 · September 27, 2025, 5:54pm

Thank you @Mario_Limonciello - I will update the toolbox to use rocm 6.4.4 - do you have a list of the optimizations that have been pushed?

@Lars_Urban - I run mine an an SSH server and doesn’t even load a Desktop. yes, that is to make use of as much RAM as I can.

kyuz0 · September 27, 2025, 5:58pm

Also @Mario_Limonciello , do you know if those optimizations also made it to rocm 7?

Mario_Limonciello · September 27, 2025, 6:24pm

They’re not in 7.0, they’ll be coming in a later release of 7.0.x.

This is the press release that goes with the 6.4.4:

https://www.amd.com/en/blogs/2025/the-road-to-rocm-on-radeon-for-windows-and-linux.html

Lars_Urban · September 28, 2025, 7:53am

Thank you very much @kyuz0 !

I miss something to fully see the whole resources.
Setup is done with:

sudo grubby --update-kernel=ALL --args=‘amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432

i also just reserved 512mb dedicated vram in bios.

but executing:

llama-cli --list-devices

shows only 86GB:

maintenance@fedora:~$ toolbox enter llama-vulkan-radv
⬢ [maintenance@toolbx ~]$ llama-cli --list-devices
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
Available devices:
Vulkan0: Radeon 8060S Graphics (RADV GFX1151) (87722 MiB, 86599 MiB free)

amdgpu_top:

some little tiny bit is missing …
Because i see the same resources in LM Studio (84GB as example)

means has nothing to do with the setup toolbox or LM Studio.
Fedora 42 specific ? O.o

KR Lars

kyuz0 · September 28, 2025, 8:19am

Have you actually tried to run large models? Sometimes the drivers report incorrect values with unified memory, worth just trying a large model and see how it goes first. If amdgpu_top reports the correct values, give it a go.

I use Fedora 42, no issues at all for me.

Djip · September 28, 2025, 3:09pm

Is it on therock? that may explain the bench diff we get from 7.0 and TheRock build. (the diff is huge: x3/x4 with llama.cpp pp)

Mario_Limonciello · September 28, 2025, 3:32pm

Yeah the changes are in mainline which is what theRock tracks. They just aren’t cherry picked yet for 7.0.x train yet.

Djip · September 28, 2025, 9:46pm

Thanks!
and like kyuz0 notice I can confirme that 6.4.4 is much faster than actual 7.0.1 release (and in paire with TheRock.)

model	size	params	backend	ngl	n_ubatch	fa	mmap	test	t/s
qwen3 8B BF16	15.26 GiB	8.19 B	ROCm 7.0.1	999	4096	1	0	pp512	325.95 ± 0.22
qwen3 8B BF16	15.26 GiB	8.19 B	ROCm 6.4.4	999	4096	1	0	pp512	1132.26 ± 2.42

Lars_Urban · October 3, 2025, 6:08pm

@kyuz0
i have downloaded:
HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF --include “Q3_K_M/*” --local-dir models/qwen3-235B-A22B/

and executed:
llama-cli -m models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf -ngl 999 --no-mmap

and it looks like i use the ram =) …. thank you for your help again ;D

kyuz0 · October 3, 2025, 6:28pm

Glad it worked!

Guest396 · October 9, 2025, 12:24am

These are great, thanks for putting everything together.

I still think I have something wrong though as the output is ~12 tokens/sec with a model like gemma3-27B and around 5/sec with a larger model like Hermes-4-70B-Q4_K_M.

Is that normal? From looking at amdgpu_top I see the memory being correctly utilized and it is running on the GPU, so I think the Toolbox is setup right.

Currently running Bazzite, but I was getting the same perf on a Fedora 42 install.

Edit to note that I get about the same performance using the vulkan-radv and rocm-6.4.4 toolboxes.

Guest396 · October 9, 2025, 11:13pm

Ok, actually benching and benching on a like for like model I am getting similar perf to what kyuz0 has the github bench mark results.

This is vulkan-radv results:

| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | Vulkan | 999 | 1 | pp512 | 1616.56 ± 3.55 |
| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | Vulkan | 999 | 1 | tg128 | 87.73 ± 0.15 |

I get worse rocm-6.4.4-rocwmma results but I think that is a Bazzite/ROCm issue. When I run the bench I get error/warnings:

rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.

Still looking around on this one.

Djip · October 10, 2025, 10:33pm

Same for me.
rocm-6.4.4 build for epel-9 did not look to have hipBLASLt config files for gfx1151.
It is OK for rocm-7.0.1 and 7.0.2 (new release today and recomanded)

Thomas_Munn · October 11, 2025, 2:30am

I tried using the toolbox version on fc43 here is what I get

lama-cli -m models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf -ngl 999 --no-mma
p
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int do
t: 1 | matrix cores: KHR_coopmat
build: 6730 (e60f01d9) with cc (GCC) 15.2.1 20250924 (Red Hat 15.2.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) (0000:c3:00.0) - 87086 MiB free
gguf_init_from_file: failed to open GGUF file ‘models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf’
llama_model_load: error loading model: llama_model_loader: failed to load model from models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instru
ct-2507-Q3_K_M-00001-of-00003.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model ‘models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf’, t
ry reducing --n-gpu-layers if you’re running out of VRAM
main: error: unable to load model
⬢ [tmunn@toolbx ~]$ HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF --include “Q3_K_M/*” --local-dir m
odels/qwen3-235B-A22B/
/home/tmunn/models/qwen3-235B-A22B
⬢ [tmunn@toolbx ~]$

No idea how to fix.

Djip · October 11, 2025, 2:38am

what report (from the container)

ls -al models/qwen3-235B-A22B/Q3_K_M/

Topic		Replies	Views
AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 Framework Desktop framework-desktop-ai-max-300 , ai	18	6124	January 14, 2026
[HOW-TO] Compiling VLLM from source on Strix Halo Framework Desktop ai	59	5426	January 7, 2026
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	16108	September 29, 2025
Which language models are you using? Framework Desktop	41	1482	March 1, 2026
[TRACKING] Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b? Framework Desktop framework-desktop-ai-max-300 , ai	35	13382	January 25, 2026

Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo

Related topics