Any way to get MTP(multiple token prediciton) work, with vllm/rocm?
Havenāt tried yet, so no idea.
Got MTP work. TPS dropped sharply from 15 tps to 8 tps.
Yeah, I tried it on my Spark too and it dropped from 43 t/s to 30 t/s or so.
Not much lift with v0.11.1 .. so many paths with amdgpu: MES failed to respond to msg=REMOVE_QUEUE
vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --dtype float16 --max-num-seqs 1 --max-model-len 32768 --enforce-eager --gpu-memory-utilization 0.8
(APIServer pid=2254) INFO 11-11 02:29:37 [api_server.py:1965] vLLM API server version 0.11.1rc6
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 59.91
Total input tokens: 1000
Total generated tokens: 1000
Request throughput (req/s): 0.02
Output token throughput (tok/s): 16.69
Peak output token throughput (tok/s): 17.00
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 33.38
---------------Time to First Token----------------
Mean TTFT (ms): 893.92
Median TTFT (ms): 893.92
P99 TTFT (ms): 893.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 59.08
Median TPOT (ms): 59.08
P99 TPOT (ms): 59.08
---------------Inter-token Latency----------------
Mean ITL (ms): 59.08
Median ITL (ms): 59.02
P99 ITL (ms): 59.62
==================================================
I dropped reproducable uv recipe with specific git commits, for deb based distros that should be adaptable to any distro with a few path tweaks, if you are looking to skip all the container cruft [Question] Is GFX1201 support planned? Ā· Issue #900 Ā· ROCm/aiter Ā· GitHub
Is there support for using vLLM to run gpt-oss-120b on strix halo yet? Iām under the impression that MXFP4 isnāt supported on AMD, but gguf is supported and there are gguf quants, so would those be a good idea, or do they all use MXFP4 in some way under the hood (considering the similar sizes) and wouldnāt work?
look like some (most) linux python package are not upload to release. (because of test failed?) but there is more on staging path:
https://rocm.nightlies.amd.com/v2-staging/
with gfx1151 / gfx120X-all / gfx110X-all (new with gfx1103 (re)added )
more element [Issue]: torch linux not build for new gfx110X-all Ā· Issue #1939 Ā· ROCm/TheRock Ā· GitHub
Ok, what can we do about it then? No vllm for the box yet? ![]()
I havenāt touched vllm on Strix Halo for a couple of weeks now, so no idea if MXFP4 works there. Itās even broken for DGX Spark now - you have to roll back to Marlin kernel for it to work, so there is that ![]()
I may try when I have time.
I encountered an issue where vllm kept trying to use aiter, even when environment variables were set to not use aiter. I then found out aiter doesnāt support gfx1151, but there is an unmerged PR adding support here: Add gfx11XX targets by mgehre-amd Ā· Pull Request #1498 Ā· ROCm/aiter Ā· GitHub I installed that with uv pip install --no-deps āgit+https://github.com/ROCm/aiter.git@mgehre-amd/gfx11ā
I managed to run gpt-oss-20b, the original model, so mxfp4 support seems working though it also seems to be a bit slow. Had to set dtype to bfloat16 and add ātrust-remote-code.
I tried to run Qwen3-Next the same way you did, and I got:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half [rank0]:[W1202 22:27:07.836698084 ProcessGroupNCCL.cpp:1552]
I had to replace ādtype float16 with ādtype auto, which got past that issue, but now Iām getting:
TypeError: Qwen3NextMTP.forward() missing 1 required positional argument: āintermediate_tensorsā
[rank0]:[W1202 22:40:05.091632576 ProcessGroupNCCL.cpp:1552]
and I donāt know how to get around this.
Looks like the fix has been merged into AITER - trying to run Qwen3-Next with a fresh VLLM build now. Looks like Qwen did something to the model, as it is re-downloading the weights.
@Eugr did you manage to make it run? How does it compare to llama.cpp in pp an tg?
Yes, it runs, but super slow - getting 11 t/s which is worse than I was getting before (16 t/s).
But the previous one was using ROCm/PyTorch nightly and outside of Docker, for this one I tried Dockerfile from VLLM. I had to uninstall and reinstall AITER though.
I guess I need to try to compile vllm on host using my previous method and compare.
Here is what Iāve done now (Iām using Podman on Fedora instead of Docker, so the parameters are a little bit different; for Docker just follow the guidance from VLLM:
Build:
mkdir vllm-docker
cd vllm-docker
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
DOCKER_BUILDKIT=1 podman build -f docker/Dockerfile.rocm -t vllm-rocm --build-arg ARG_PYTORCH_ROCM_ARCH=gfx1151 --format docker .
Run:
podman run -it --rm \
--network=host \
--group-add=keep-groups \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--security-opt label=disable \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface:Z \
vllm-rocm
Inside, run a model:
pip uninstall aiter
pip install --no-deps "git+https://github.com/ROCm/aiter.git"
vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 8888 --host 0.0.0.0 --max-model-len 32768 --max-num-seqs 10
yes there is something funky going on.
few things which can be:
pass the render and video group. idk if needed:
podman run -it --rm
ānetwork=host
āgroup-add=keep-groups
ādevice /dev/kfd
ādevice /dev/dri
āgroup-add video
āgroup-add render \
is the gpu detected, maybe the load falls back to cpu in hte container: rocminfo | grep -i gfx1151
but indeed the build inside has to be gfx1151 compatible.
at least podman/ docker offer this portability and cleanliness.
No, the docker build sees GPU just fine, builds CUDA graphs, etc. Iāve just also tried to build on the host and got the same 11 t/s for this model, so there has to be some regression in VLLM itself, given that Iām using the same pytorch/triton/flash_attn as before.
I donāt have time to debug this now, but at least itās good that the āofficialā docker build works now with just a minor change (fresh aiter build).
Iāll try again in a few weeks - donāt have time to spend on this as vllm support on Strix Halo is not essential for me anymore, now that I have a cluster with dual DGX Sparks. Not that it was trouble-free there, but at least I got it working with more or less acceptable performance.
I think one of the things that happened was they broke multi-token prediction for Qwen3-Next. I seem to recall it working at one point, but now attempting to enable it gets the following error on model load:
TypeError: Qwen3NextMTP.forward() missing 1 required positional argument: āintermediate_tensorsā
[rank0]:[W1202 22:40:05.091632576 ProcessGroupNCCL.cpp:1552]
Not sure if that is the cause of your performance loss though. Iām trying to find a fix for this right now.
I can confirm that it is running at only ~9.5 t/s for me at 16-20k context.
Also, since this thread has been probably the most helpful resource for running vLLM on strix halo that Iāve seen, I should note for anyone who comes across this guide:
If you are getting errors that look like one of these two:
Memory access fault by GPU node-1 (Agent handle: 0x55981f276340) on address 0x7f4812b5a000. Reason: Page not present or supervisor privilege.
HW Exception by GPU node-1 (Agent handle: 0x55a709dc3390) reason :GPU Hang
You should first check your MES version by running:
sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info | grep MES
which should get you something like
MES feature version: 1, firmware version: 0x00000080
You want it to say 80 at the end, not 83. 83 causes memory access fault almost immediately upon trying to load a model. If you update amd-gpu-firmware and linux-firmware to 20251125, the MES firmware will update to 83 and itās annoying rolling that back. Stay on a version before 20251125, like 20251111 or 20251021, until amd fixes the issue. If you do update to 20251125, MES firmware may update to 83 and not downgrade automatically even if you downgrade amd-gpu-firmware and linux-firmware (at which point consult ChatGPT or something for a guide on how to roll back MES firmware specifically, I donāt quite understand the steps so I will not repeat them here).
Secondly, if you are on MES firmware version 80, GPU hangs and memory access faults still happen, just only when you start hitting memory pretty heavily or when you try to do other desktop tasks while vLLM is running in the background. To fix that, add:
amdgpu.cwsr_enable=0
to your kernel parameter by doing:
sudo nano /etc/default/grub
and paste amdgpu.cwsr_enable=0 into the line that starts with GRUB_CMDLINE_LINUX=, like this:
GRUB_CMDLINE_LINUX="rhgb quiet amdgpu.cwsr_enable=0"
This is the same line in which you set GTT memory allocation. Change this, save, and reboot.
This fixed the GPU hang and memory access faults, at the cost of possibly some frame drops/choppiness on the desktop, but so far Iāve not noticed anything.
I was able to get MTP on Qwen3-Next working. However, Iām not sure if it is worth it. Seems like the small prediction model performs too poorly at longer context lengths and actually worsens performance. Within the first few thousand tokens of context it adds maybe 15% performance but at the 16k token context that I was testing, it actually leads to about -10% performance.
If anyone here wants to try, there are two ways to get MTP working for Qwen3-Next, one is by doing āenforce-eager (which has itās own negative performance impact) and one is by patching the qwen3_next_mtp.py file in vllm (found in /vllm/model_executor/models/), which I managed to do with the help of ChatGPT. Basically the issue is that torch compile thinks thereās something wrong, when there actually isnāt anything wrong, so we either use āenforce-eager to skip torch compile entirely or modify the file to skip torch compile for the specific problematic part, which is done by:
Replace
from vllm.compilation.decorators import support_torch_compile
with:
from vllm.compilation.decorators import support_torch_compile, ignore_torch_compile
then replace
@support_torch_compile
class Qwen3NextMTP(nn.Module, SupportsPP, QwenNextMixtureOfExperts):
...
with:
@ignore_torch_compile
class Qwen3NextMTP(nn.Module, SupportsPP, QwenNextMixtureOfExperts):
...
I was also told to replace:
hidden_states = self.model(
input_ids, positions, hidden_states, intermediate_tensors, inputs_embeds
)
under def forward( in the Qwen3NextMTP class with:
hidden_states = self.model(
input_ids=input_ids,
positions=positions,
hidden_states=hidden_states,
intermediate_tensors=intermediate_tensors,
inputs_embeds=inputs_embeds,
spec_step_idx=spec_step_idx,
)
Here is how I ran the model:
vllm serve /path/to/folder/containing/downloaded/model \
--served-model-name "qwen3-next-80b-a3b-thinking (vllm 4 bit)" \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 65536 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--reasoning-parser deepseek_r1 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--host 0.0.0.0 \
--port <whatever port you want here>
@Daniel_H212 what tps are you getting with MTP?