Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo

kyuz0 · October 11, 2025, 6:33am

The only thing I can think of is that these kernel parameters are not set:

amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432

Djip · October 11, 2025, 9:46am

struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params) {
    FILE * file = ggml_fopen(fname, "rb");

    if (!file) {
        GGML_LOG_ERROR("%s: failed to open GGUF file '%s'\n", __func__, fname);
        return nullptr;
    }

    struct gguf_context * result = gguf_init_from_file_impl(file, params);
    fclose(file);
    return result;
}

=> I would say it’s more of a path or file rights issue.

LHAR_GIL · October 13, 2025, 6:34am

@kyuz0 Just wanna drop a thank you for the toolboxes. Made setting up things a whole lot easier. Been using the toolboxes with Arch (Omarchy). Image-gen has been fun, LLM performance pretty good, but I wonder If i can squeeze out more ? To be fair, I’ve only been using vulkan-amdvlk .

I’ll make time to post benchmarks soon.

LHAR_GIL · October 13, 2025, 6:36am

I’ve gotten this issue once. First make sure that the model specified has been downloaded completely, in the correct directory/path.

kyuz0 · October 13, 2025, 6:53am

Thank you! I’d say there’s space for improvement. For llama.cpp, I know one of the developers is working on writing optimized kernels for AMD GPUs, once that’s done in a few months, you’ll see better performance.

For pytorch-base workflows, there’s also space for improvement, as soon as AMD fixes their TheRock pipelines and they ship wheels with AOTriton, performance will improve.

Thomas_Munn · October 19, 2025, 2:37pm

I found out why my models weren’t loading. I had to convert them to GGUF (not a horrible process) format PRIOR to running them. Basically I ran llama.cpp and some python scripts they include to convert. Once I converted it all the models worked fine. I loaded a 70b paramater model into memory and it worked quite well.

kyuz0 · October 19, 2025, 4:37pm

On hugging face you typically find the ggufs for most models ready to download.

Djip · October 19, 2025, 6:47pm

Many can be found here: bartowski (Bartowski)
he does a great job.

Eugr · October 19, 2025, 8:54pm

I usually check unsloth first, and then get bartowski quant if unsloth is not available

Eugr · October 19, 2025, 9:46pm

BTW, vllm 0.11.0 and above includes support for gfx1151.

I managed to build from the main branch - had to apply @kyuz0 fixes for amdsmi (which still crashes) and platform detection, but other than that, just followed vllm installation instructions for Rocm and was able to compile (had to set a few extra env vars though).
Successfully loaded cpatonn/Qwen3-VL-4B-Instruct-AWQ-8bit and was able to process an image with pp of 512 t/s and generation speed of 35 t/s.

Had to use enforce-eager otherwise CUDA graph compilation fails with errors.

No luck with FP8 version from Qwen - the loading takes forever with this warning:

(EngineCore_DP0 pid=47943) WARNING 10-19 14:44:29 [fp8_utils.py:785] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/eugr/vllm/vllm/vllm/model_executor/layers/quantization/utils/c
onfigs/N=6144,K=4096,device_name=Radeon_8060S_Graphics,dtype=fp8_w8a8,block_shape=[128,128].json

Eugr · October 20, 2025, 12:15am

Update:

Qwen3 VL AWQ quants work for dense models, but not for MOE.
It even works without –enforce-eager, and has better prompt processing speed, but hangs or crashes often, so better use without CUDA graphs.

Image processing takes much longer than comparable model in llama.cpp. Generation is about the same speed.

Wasn’t able to run gpt-oss because MXFP is not supported. Not sure if non-VL Qwen MOE models will work too.

EDIT: looks like I’m getting HIP-related crashes because of amd_iommu=off - turns out amd_iommu=pt is needed for correct GTT memory management. I will test this tomorrow.

Eugr · October 20, 2025, 8:15pm

So, I’ve tried amd_iommu=pt, and it still crashes on compiling CUDA graphs, so I guess it’s not it. I get 5% slower speeds in llama.cpp though, so I turned it back off for now.

Topic		Replies	Views
AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 Framework Desktop framework-desktop-ai-max-300 , ai	15	998	October 8, 2025
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	8479	September 29, 2025
Will the AI Max+ 395 (128GB) be able to run gpt-oss-120b? Framework Desktop framework-desktop-ai-max-300 , ai	31	6970	October 25, 2025
LLM Performance Framework Desktop ai	26	5489	June 11, 2025
Framework 13 + Ryzen AI + Linux Distro + LLM Linux ubuntu , fedora	19	2297	October 13, 2025

Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo

Related topics