[HOW-TO] Compiling VLLM from source on Strix Halo

Ziliang_Qian · November 5, 2025, 6:02am

Any way to get MTP(multiple token prediciton) work, with vllm/rocm?

Eugr · November 5, 2025, 6:34am

Haven’t tried yet, so no idea.

Ziliang_Qian · November 7, 2025, 8:56pm

Got MTP work. TPS dropped sharply from 15 tps to 8 tps.

Eugr · November 7, 2025, 9:46pm

Yeah, I tried it on my Spark too and it dropped from 43 t/s to 30 t/s or so.

Ian_MacDonald · November 11, 2025, 3:03am

Not much lift with v0.11.1 .. so many paths with amdgpu: MES failed to respond to msg=REMOVE_QUEUE

vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --dtype float16 --max-num-seqs 1 --max-model-len 32768 --enforce-eager --gpu-memory-utilization 0.8
(APIServer pid=2254) INFO 11-11 02:29:37 [api_server.py:1965] vLLM API server version 0.11.1rc6

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  59.91     
Total input tokens:                      1000      
Total generated tokens:                  1000      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         16.69     
Peak output token throughput (tok/s):    17.00     
Peak concurrent requests:                1.00      
Total Token throughput (tok/s):          33.38     
---------------Time to First Token----------------
Mean TTFT (ms):                          893.92    
Median TTFT (ms):                        893.92    
P99 TTFT (ms):                           893.92    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.08     
Median TPOT (ms):                        59.08     
P99 TPOT (ms):                           59.08     
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.08     
Median ITL (ms):                         59.02     
P99 ITL (ms):                            59.62     
==================================================

Ian_MacDonald · November 12, 2025, 10:01pm

I dropped reproducable uv recipe with specific git commits, for deb based distros that should be adaptable to any distro with a few path tweaks, if you are looking to skip all the container cruft [Question] Is GFX1201 support planned? · Issue #900 · ROCm/aiter · GitHub

Daniel_H212 · November 24, 2025, 5:00am

Is there support for using vLLM to run gpt-oss-120b on strix halo yet? I’m under the impression that MXFP4 isn’t supported on AMD, but gguf is supported and there are gguf quants, so would those be a good idea, or do they all use MXFP4 in some way under the hood (considering the similar sizes) and wouldn’t work?

Djip · November 24, 2025, 10:12am

look like some (most) linux python package are not upload to release. (because of test failed?) but there is more on staging path:

https://rocm.nightlies.amd.com/v2-staging/

with gfx1151 / gfx120X-all / gfx110X-all (new with gfx1103 (re)added )

more element [Issue]: torch linux not build for new gfx110X-all · Issue #1939 · ROCm/TheRock · GitHub

Christian_Kniep · November 24, 2025, 12:35pm

Ok, what can we do about it then? No vllm for the box yet?

Eugr · November 25, 2025, 7:45pm

I haven’t touched vllm on Strix Halo for a couple of weeks now, so no idea if MXFP4 works there. It’s even broken for DGX Spark now - you have to roll back to Marlin kernel for it to work, so there is that

I may try when I have time.

Daniel_H212 · December 3, 2025, 3:49am

I encountered an issue where vllm kept trying to use aiter, even when environment variables were set to not use aiter. I then found out aiter doesn’t support gfx1151, but there is an unmerged PR adding support here: Add gfx11XX targets by mgehre-amd · Pull Request #1498 · ROCm/aiter · GitHub I installed that with uv pip install --no-deps “git+https://github.com/ROCm/aiter.git@mgehre-amd/gfx11”

I managed to run gpt-oss-20b, the original model, so mxfp4 support seems working though it also seems to be a bit slow. Had to set dtype to bfloat16 and add –trust-remote-code.

I tried to run Qwen3-Next the same way you did, and I got:

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half [rank0]:[W1202 22:27:07.836698084 ProcessGroupNCCL.cpp:1552]

I had to replace –dtype float16 with –dtype auto, which got past that issue, but now I’m getting:

TypeError: Qwen3NextMTP.forward() missing 1 required positional argument: ‘intermediate_tensors’
[rank0]:[W1202 22:40:05.091632576 ProcessGroupNCCL.cpp:1552]

and I don’t know how to get around this.

Eugr · December 4, 2025, 8:43pm

Looks like the fix has been merged into AITER - trying to run Qwen3-Next with a fresh VLLM build now. Looks like Qwen did something to the model, as it is re-downloading the weights.

_sk · December 4, 2025, 9:25pm

@Eugr did you manage to make it run? How does it compare to llama.cpp in pp an tg?

Eugr · December 4, 2025, 9:35pm

Yes, it runs, but super slow - getting 11 t/s which is worse than I was getting before (16 t/s).
But the previous one was using ROCm/PyTorch nightly and outside of Docker, for this one I tried Dockerfile from VLLM. I had to uninstall and reinstall AITER though.

I guess I need to try to compile vllm on host using my previous method and compare.

Here is what I’ve done now (I’m using Podman on Fedora instead of Docker, so the parameters are a little bit different; for Docker just follow the guidance from VLLM:

Build:

mkdir vllm-docker
cd vllm-docker
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
DOCKER_BUILDKIT=1 podman build -f docker/Dockerfile.rocm -t vllm-rocm --build-arg ARG_PYTORCH_ROCM_ARCH=gfx1151 --format docker .

Run:

podman run -it --rm \
  --network=host \
  --group-add=keep-groups \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --security-opt label=disable \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface:Z \
  vllm-rocm

Inside, run a model:

pip uninstall aiter
pip install --no-deps "git+https://github.com/ROCm/aiter.git"
vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 8888 --host 0.0.0.0 --max-model-len 32768 --max-num-seqs 10

_sk · December 4, 2025, 10:00pm

yes there is something funky going on.

few things which can be:

pass the render and video group. idk if needed:

podman run -it --rm
–network=host
–group-add=keep-groups
–device /dev/kfd
–device /dev/dri
–group-add video
–group-add render \

is the gpu detected, maybe the load falls back to cpu in hte container: rocminfo | grep -i gfx1151

but indeed the build inside has to be gfx1151 compatible.

at least podman/ docker offer this portability and cleanliness.

Eugr · December 4, 2025, 10:04pm

No, the docker build sees GPU just fine, builds CUDA graphs, etc. I’ve just also tried to build on the host and got the same 11 t/s for this model, so there has to be some regression in VLLM itself, given that I’m using the same pytorch/triton/flash_attn as before.

I don’t have time to debug this now, but at least it’s good that the “official” docker build works now with just a minor change (fresh aiter build).

I’ll try again in a few weeks - don’t have time to spend on this as vllm support on Strix Halo is not essential for me anymore, now that I have a cluster with dual DGX Sparks. Not that it was trouble-free there, but at least I got it working with more or less acceptable performance.

Daniel_H212 · December 5, 2025, 2:40am

I think one of the things that happened was they broke multi-token prediction for Qwen3-Next. I seem to recall it working at one point, but now attempting to enable it gets the following error on model load:

TypeError: Qwen3NextMTP.forward() missing 1 required positional argument: ‘intermediate_tensors’
[rank0]:[W1202 22:40:05.091632576 ProcessGroupNCCL.cpp:1552]

Not sure if that is the cause of your performance loss though. I’m trying to find a fix for this right now.

I can confirm that it is running at only ~9.5 t/s for me at 16-20k context.

Daniel_H212 · December 5, 2025, 3:33am

Also, since this thread has been probably the most helpful resource for running vLLM on strix halo that I’ve seen, I should note for anyone who comes across this guide:

If you are getting errors that look like one of these two:

Memory access fault by GPU node-1 (Agent handle: 0x55981f276340) on address 0x7f4812b5a000. Reason: Page not present or supervisor privilege.

HW Exception by GPU node-1 (Agent handle: 0x55a709dc3390) reason :GPU Hang

You should first check your MES version by running:

sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info | grep MES

which should get you something like

MES feature version: 1, firmware version: 0x00000080

You want it to say 80 at the end, not 83. 83 causes memory access fault almost immediately upon trying to load a model. If you update amd-gpu-firmware and linux-firmware to 20251125, the MES firmware will update to 83 and it’s annoying rolling that back. Stay on a version before 20251125, like 20251111 or 20251021, until amd fixes the issue. If you do update to 20251125, MES firmware may update to 83 and not downgrade automatically even if you downgrade amd-gpu-firmware and linux-firmware (at which point consult ChatGPT or something for a guide on how to roll back MES firmware specifically, I don’t quite understand the steps so I will not repeat them here).

Secondly, if you are on MES firmware version 80, GPU hangs and memory access faults still happen, just only when you start hitting memory pretty heavily or when you try to do other desktop tasks while vLLM is running in the background. To fix that, add:

amdgpu.cwsr_enable=0

to your kernel parameter by doing:

sudo nano /etc/default/grub

and paste amdgpu.cwsr_enable=0 into the line that starts with GRUB_CMDLINE_LINUX=, like this:

GRUB_CMDLINE_LINUX="rhgb quiet amdgpu.cwsr_enable=0"

This is the same line in which you set GTT memory allocation. Change this, save, and reboot.

This fixed the GPU hang and memory access faults, at the cost of possibly some frame drops/choppiness on the desktop, but so far I’ve not noticed anything.

Daniel_H212 · December 5, 2025, 4:21am

I was able to get MTP on Qwen3-Next working. However, I’m not sure if it is worth it. Seems like the small prediction model performs too poorly at longer context lengths and actually worsens performance. Within the first few thousand tokens of context it adds maybe 15% performance but at the 16k token context that I was testing, it actually leads to about -10% performance.

If anyone here wants to try, there are two ways to get MTP working for Qwen3-Next, one is by doing –enforce-eager (which has it’s own negative performance impact) and one is by patching the qwen3_next_mtp.py file in vllm (found in /vllm/model_executor/models/), which I managed to do with the help of ChatGPT. Basically the issue is that torch compile thinks there’s something wrong, when there actually isn’t anything wrong, so we either use –enforce-eager to skip torch compile entirely or modify the file to skip torch compile for the specific problematic part, which is done by:

Replace

from vllm.compilation.decorators import support_torch_compile

with:

from vllm.compilation.decorators import support_torch_compile, ignore_torch_compile

then replace

@support_torch_compile
class Qwen3NextMTP(nn.Module, SupportsPP, QwenNextMixtureOfExperts):
...

with:

@ignore_torch_compile
class Qwen3NextMTP(nn.Module, SupportsPP, QwenNextMixtureOfExperts):
...

I was also told to replace:

hidden_states = self.model(
            input_ids, positions, hidden_states, intermediate_tensors, inputs_embeds
        )

under def forward( in the Qwen3NextMTP class with:

hidden_states = self.model(
            input_ids=input_ids,
            positions=positions,
            hidden_states=hidden_states,
            intermediate_tensors=intermediate_tensors,
            inputs_embeds=inputs_embeds,
            spec_step_idx=spec_step_idx,
        )

Here is how I ran the model:

vllm serve /path/to/folder/containing/downloaded/model \
  --served-model-name "qwen3-next-80b-a3b-thinking (vllm 4 bit)" \
  --dtype auto \
  --quantization compressed-tensors \
  --max-model-len 65536 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.9 \
  --reasoning-parser deepseek_r1 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port <whatever port you want here>

_sk · December 5, 2025, 5:56am

@Daniel_H212 what tps are you getting with MTP?

Topic		Replies	Views
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	7811	February 2, 2026
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	16108	September 29, 2025
AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 Framework Desktop framework-desktop-ai-max-300 , ai	18	6124	January 14, 2026
PyTorch w/ Flash Attention + vLLM for Strix Halo Framework Desktop ai	1	2150	August 31, 2025
DGX Spark vs. Strix Halo - Initial Impressions Framework Desktop	46	5091	December 26, 2025

[HOW-TO] Compiling VLLM from source on Strix Halo

Related topics