Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo

I just wanted to share the toolboxes I’ve been working on together with folks at https://strixhalo-homelab.d7.wtf/.

The idea is to provide ready-to-use environments for LLM inference tailored at Strix Halo, together with configuration tips.

These all require you to install a recent Linux distribution, I recommend Fedora 42.

Llama.cpp Toolboxes

These are currently the best option for running many different LLMs on the Framework Desktop. You will see many toolboxes with different backends, right now I’d choose amdvlk or rocm-6.4.3-rocwmma. The repo has a link to a YouTube video that explains all the setup to get this going quickly.

vLLM Toolboxes / Container (WIP / EXPERIMENTAL)

This is still WIP and vLLM on Strix Halo is PITA at the moment, appreciate any contributions to support more models, right now this requires a lot of patching and still you can only run a subset of models.

Hope you find this helpful!

14 Likes

I’ve been using your llama.cpp repo/Dockerfiles (specifically rocm-7rc-rocwmma with extra bits to build the RPC server) and it works really well. Thank you :slight_smile:

1 Like

Thank you, glad you’re finding it useful! ROCm support will also be improving - so hopefully we can squeeze even more performance out of this!

1 Like

Thank you for this perfect starting point! A silly question: you chose Fedora Workstation. Isn’t that also a full-fledged desktop environment (only with Gnome)? At this point, I tested it with the KDE Plasma desktop. Did I overlook something? Does it consume too many resources to fully utilize my 128 GB ?

Thank you in advance :wink:

If you haven’t already; upgrade to rocm 6.4.4, it has optimizations specifically for Strix Halo.

Thank you @Mario_Limonciello - I will update the toolbox to use rocm 6.4.4 - do you have a list of the optimizations that have been pushed?

@Lars_Urban - I run mine an an SSH server and doesn’t even load a Desktop. yes, that is to make use of as much RAM as I can.

Also @Mario_Limonciello , do you know if those optimizations also made it to rocm 7?

They’re not in 7.0, they’ll be coming in a later release of 7.0.x.

This is the press release that goes with the 6.4.4:

https://www.amd.com/en/blogs/2025/the-road-to-rocm-on-radeon-for-windows-and-linux.html

1 Like

Thank you very much @kyuz0 !

I miss something to fully see the whole resources.
Setup is done with:

sudo grubby --update-kernel=ALL --args=‘amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432

i also just reserved 512mb dedicated vram in bios.

but executing:

llama-cli --list-devices

shows only 86GB:

maintenance@fedora:~$ toolbox enter llama-vulkan-radv
⬢ [maintenance@toolbx ~]$ llama-cli --list-devices
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
Available devices:
Vulkan0: Radeon 8060S Graphics (RADV GFX1151) (87722 MiB, 86599 MiB free)

amdgpu_top:

some little tiny bit is missing …
Because i see the same resources in LM Studio (84GB as example)

means has nothing to do with the setup toolbox or LM Studio.
Fedora 42 specific ? O.o

KR Lars

Have you actually tried to run large models? Sometimes the drivers report incorrect values with unified memory, worth just trying a large model and see how it goes first. If amdgpu_top reports the correct values, give it a go.

I use Fedora 42, no issues at all for me.

1 Like

Is it on therock? that may explain the bench diff we get from 7.0 and TheRock build. (the diff is huge: x3/x4 with llama.cpp pp)

Yeah the changes are in mainline which is what theRock tracks. They just aren’t cherry picked yet for 7.0.x train yet.

1 Like

Thanks!
and like kyuz0 notice I can confirme that 6.4.4 is much faster than actual 7.0.1 release (and in paire with TheRock.)

model size params backend ngl n_ubatch fa mmap test t/s
qwen3 8B BF16 15.26 GiB 8.19 B ROCm 7.0.1 999 4096 1 0 pp512 325.95 ± 0.22
qwen3 8B BF16 15.26 GiB 8.19 B ROCm 6.4.4 999 4096 1 0 pp512 1132.26 ± 2.42

@kyuz0
i have downloaded:
HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF --include “Q3_K_M/*” --local-dir models/qwen3-235B-A22B/

and executed:
llama-cli -m models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf -ngl 999 --no-mmap

and it looks like i use the ram =) …. thank you for your help again ;D

1 Like

Glad it worked!

1 Like

These are great, thanks for putting everything together.

I still think I have something wrong though as the output is ~12 tokens/sec with a model like gemma3-27B and around 5/sec with a larger model like Hermes-4-70B-Q4_K_M.

Is that normal? From looking at amdgpu_top I see the memory being correctly utilized and it is running on the GPU, so I think the Toolbox is setup right.

Currently running Bazzite, but I was getting the same perf on a Fedora 42 install.

Edit to note that I get about the same performance using the vulkan-radv and rocm-6.4.4 toolboxes.

Ok, actually benching and benching on a like for like model I am getting similar perf to what kyuz0 has the github bench mark results.

This is vulkan-radv results:

| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | Vulkan | 999 | 1 | pp512 | 1616.56 ± 3.55 |
| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | Vulkan | 999 | 1 | tg128 | 87.73 ± 0.15 |

I get worse rocm-6.4.4-rocwmma results but I think that is a Bazzite/ROCm issue. When I run the bench I get error/warnings:

rocBLAS error: No hipBLASLt solution found
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.

Still looking around on this one.

Same for me.
rocm-6.4.4 build for epel-9 did not look to have hipBLASLt config files for gfx1151.
It is OK for rocm-7.0.1 and 7.0.2 (new release today and recomanded)

I tried using the toolbox version on fc43 here is what I get

lama-cli -m models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf -ngl 999 --no-mma
p
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int do
t: 1 | matrix cores: KHR_coopmat
build: 6730 (e60f01d9) with cc (GCC) 15.2.1 20250924 (Red Hat 15.2.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) (0000:c3:00.0) - 87086 MiB free
gguf_init_from_file: failed to open GGUF file ‘models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf’
llama_model_load: error loading model: llama_model_loader: failed to load model from models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instru
ct-2507-Q3_K_M-00001-of-00003.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model ‘models/qwen3-235B-A22B/Q3_K_M/Qwen3-235B-A22B-Instruct-2507-Q3_K_M-00001-of-00003.gguf’, t
ry reducing --n-gpu-layers if you’re running out of VRAM
main: error: unable to load model
⬢ [tmunn@toolbx ~]$ HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF --include “Q3_K_M/*” --local-dir m
odels/qwen3-235B-A22B/
/home/tmunn/models/qwen3-235B-A22B
⬢ [tmunn@toolbx ~]$

No idea how to fix.

what report (from the container)

ls -al models/qwen3-235B-A22B/Q3_K_M/