The only thing I can think of is that these kernel parameters are not set:
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
The only thing I can think of is that these kernel parameters are not set:
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params) {
FILE * file = ggml_fopen(fname, "rb");
if (!file) {
GGML_LOG_ERROR("%s: failed to open GGUF file '%s'\n", __func__, fname);
return nullptr;
}
struct gguf_context * result = gguf_init_from_file_impl(file, params);
fclose(file);
return result;
}
=> I would say it’s more of a path or file rights issue.
![]()
@kyuz0 Just wanna drop a thank you for the toolboxes. Made setting up things a whole lot easier. Been using the toolboxes with Arch (Omarchy). Image-gen has been fun, LLM performance pretty good, but I wonder If i can squeeze out more ? To be fair, I’ve only been using vulkan-amdvlk .
I’ll make time to post benchmarks soon.
I’ve gotten this issue once. First make sure that the model specified has been downloaded completely, in the correct directory/path.
Thank you! I’d say there’s space for improvement. For llama.cpp, I know one of the developers is working on writing optimized kernels for AMD GPUs, once that’s done in a few months, you’ll see better performance.
For pytorch-base workflows, there’s also space for improvement, as soon as AMD fixes their TheRock pipelines and they ship wheels with AOTriton, performance will improve.
I found out why my models weren’t loading. I had to convert them to GGUF (not a horrible process) format PRIOR to running them. Basically I ran llama.cpp and some python scripts they include to convert. Once I converted it all the models worked fine. I loaded a 70b paramater model into memory and it worked quite well.
On hugging face you typically find the ggufs for most models ready to download.
Many can be found here: bartowski (Bartowski)
he does a great job.
I usually check unsloth first, and then get bartowski quant if unsloth is not available
BTW, vllm 0.11.0 and above includes support for gfx1151.
I managed to build from the main branch - had to apply @kyuz0 fixes for amdsmi (which still crashes) and platform detection, but other than that, just followed vllm installation instructions for Rocm and was able to compile (had to set a few extra env vars though).
Successfully loaded cpatonn/Qwen3-VL-4B-Instruct-AWQ-8bit and was able to process an image with pp of 512 t/s and generation speed of 35 t/s.
Had to use enforce-eager otherwise CUDA graph compilation fails with errors.
No luck with FP8 version from Qwen - the loading takes forever with this warning:
(EngineCore_DP0 pid=47943) WARNING 10-19 14:44:29 [fp8_utils.py:785] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/eugr/vllm/vllm/vllm/model_executor/layers/quantization/utils/c
onfigs/N=6144,K=4096,device_name=Radeon_8060S_Graphics,dtype=fp8_w8a8,block_shape=[128,128].json
Update:
Qwen3 VL AWQ quants work for dense models, but not for MOE.
It even works without –enforce-eager, and has better prompt processing speed, but hangs or crashes often, so better use without CUDA graphs.
Image processing takes much longer than comparable model in llama.cpp. Generation is about the same speed.
Wasn’t able to run gpt-oss because MXFP is not supported. Not sure if non-VL Qwen MOE models will work too.
EDIT: looks like I’m getting HIP-related crashes because of amd_iommu=off - turns out amd_iommu=pt is needed for correct GTT memory management. I will test this tomorrow.
So, I’ve tried amd_iommu=pt, and it still crashes on compiling CUDA graphs, so I guess it’s not it. I get 5% slower speeds in llama.cpp though, so I turned it back off for now.