Compiling VLLM from source on Strix Halo

Finally put together a recipe for compiling VLLM from source on Strix Halo. Please note that while it worked yesterday, it may stop working tomorrow if they introduce any breaking changes. Please let me know.

Building vLLM for AMD Strix Halo

This is a recipe tested on Fedora 43 beta, but should work in any other recent Linux distro.

Prepare environment

First, install uv if you don’t have it on your system yet: Installation | uv

I prefer the pipx route, but you can use any method that works.

After uv is installed, prepare virtual environment for Python:

mkdir ~/vllm
cd ~/vllm
uv venv --python 3.13
source .venv/bin/activate

Install ROCm Python packages

First, we’ll install fresh nightly builds from TheRock:

uv pip install \
  --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ \
  "rocm[libraries,devel]"

uv pip install \
  --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ \
  --pre torch torchaudio torchvision

Then, download full ROCm tarball that corresponds to the version pyTorch is compiled against:

ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}')
# check that it works correctly
# example output: Detected ROCm Version: 7.10.0a20251015
echo "Detected ROCm Version: $ROCM_VERSION"
# if you see the version number, download nightly tarball for that version
wget "https://therock-nightly-tarball.s3.amazonaws.com/therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz"

Extract the tarball. You can use any directory, I’m just using the current directory to make it all self-contained:

mkdir rocm-${ROCM_VERSION}
tar xzf therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz -C rocm-${ROCM_VERSION}

Configure environment variables:

export ROCM_PATH=${PWD}/rocm-$ROCM_VERSION
export LD_LIBRARY_PATH=$ROCM_PATH/lib
export DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode  
export HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
export PYTORCH_ROCM_ARCH="gfx1151"

Check if ROCm works:

$ROCM_PATH/bin/amd-smi

You should be able to see something like this:

+------------------------------------------------------------------------------+
| AMD-SMI 26.1.0+c9ffff43      amdgpu version: Linuxver ROCm version: 7.10.0   |
| VBIOS version: 023.011.000.039.000001                                        |
| Platform: Linux Baremetal                                                    |
|-------------------------------------+----------------------------------------|
| BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |
| GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |
|=====================================+========================================|
| 0000:c5:00.0  Radeon 8060S Graphics | N/A        N/A   0             N/A/0 W |
|   0       0     N/A             N/A | N/A        N/A             147/1024 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes:                                                                   |
|  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |
|==============================================================================|
|  No running processes found                                                  |
+------------------------------------------------------------------------------+

Checkout VLLM repository:

git clone https://github.com/vllm-project/vllm.git 
cd vllm

Apply a workaround for amdsmi Python package crash - if it’s imported before pyTorch, the crash is not happening:

echo "diff --git a/vllm/__init__.py b/vllm/__init__.py
index 19b2cdc67..efb2526fe 100644
--- a/vllm/__init__.py
+++ b/vllm/__init__.py
@@ -6,6 +6,7 @@
 # version library first.  Such assumption is critical for some customization.
 from .version import __version__, __version_tuple__  # isort:skip
 
+import amdsmi
 import typing
 
 # The environment variables override should be imported before any other" | patch -p1

Check if patch is applied:

git diff

You should see this:

diff --git a/vllm/__init__.py b/vllm/__init__.py
index 19b2cdc67..efb2526fe 100644
--- a/vllm/__init__.py
+++ b/vllm/__init__.py
@@ -6,6 +6,7 @@
# version library first.  Such assumption is critical for some customization.
from .version import __version__, __version_tuple__  # isort:skip

+import amdsmi
import typing

# The environment variables override should be imported before any other

If everything is alright, install dependencies and build vllm:

# Uninstall amdsmi, will install later - CRASHING AS OF 10/24/2025
uv pip uninstall amdsmi

# use existing Torch
python use_existing_torch.py 

# Install dependencies
uv pip install --upgrade numba \
    scipy \
    huggingface-hub[cli,hf_transfer] \
    setuptools_scm
uv pip install "numpy<2"
uv pip install -r requirements/rocm.txt

# Build vLLM
python setup.py develop

# Reinstall amd_smi
uv pip install /opt/rocm/share/amd_smi

It will generate some warnings, but you can ignore them as long as the command completes successfully.

To test:

vllm --version

Install Flash Attention

cd ..
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
git checkout main_perf

Patch setup.py:

echo "diff --git a/setup.py b/setup.py
index d54e93f6..f7d282df 100644
--- a/setup.py
+++ b/setup.py
@@ -19,6 +19,7 @@ import urllib.request
 import urllib.error
 from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
 
+import amdsmi
 import torch
 from torch.utils.cpp_extension import (
     BuildExtension," | patch -p1

Build:

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install
cd ..

You can then run VLLM like this:

vllm serve cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit \
--dtype float16 --max-model-len 32768 \
--allowed-local-media-path / \
--limit-mm-per-prompt '{"image": 3, "video": 1}' \
-tp 1 --max-num-seqs 1 \
--port 8888 --host 0.0.0.0

Configure environment

Create a script ~/vllm/vllm_env.sh:

#!/bin/bash

source .venv/bin/activate

ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}')
export ROCM_PATH=${PWD}/rocm-$ROCM_VERSION
export LD_LIBRARY_PATH=$ROCM_PATH/lib
export DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode  
export HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
export PYTORCH_ROCM_ARCH="gfx1151"

Running VLLM

Before running VLLM you need to activate the virtual environment and set the variables first (unless you’ve already done that in the current session). Just run:

cd ~/vllm
source vllm_env.sh

And then you can run vllm!

Known issues

FP8 models don’t work yet. Use AWQ 8-bit quants instead.
MXFP4 is also not supported yet (gpt-oss models).
BF16 models work fine.

Include --dtype float16 in your parameter list for better performance.

If experiencing HIP-related crashes, try to limit max-num-seqs, e.g.:

vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 8888 --host 0.0.0.0 --max-model-len 32768 --max-num-seqs 10

Tested Models

  • Qwen/Qwen3-VL-4B-Instruct
  • cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit
  • cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-8bit
  • cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit - crashes without max-num-seqs limitation
vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 8888 --host 0.0.0.0 --max-model-len 32768 --max-num-seqs 10
5 Likes

Why install the devel and not use it? is it needed?
Did it not work with rocm[devel] ?

I don’t remember exactly, but I believe there was some missing dependency if you don’t install this package.

Now I find how to build llama.cpp with the python package:

I may look a that too. But not much time for now.

I found it easier just to get ROCm tarball in addition to Python packages.

failed to start, AssertionError: expected size 4==2048, stride 2048==2048 at dim=0

I’ve compiled your examples into a dockerfile so others can set this up.

This does work but it’s a bit flaky, the major issue is the flash attention mechanism.

# syntax=docker/dockerfile:1.4
FROM fedora:43
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

ARG ROCM_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx1151/

WORKDIR /opt/vllm-build
RUN uv venv --python 3.13
ENV VIRTUAL_ENV=/opt/vllm-build/.venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN dnf install -y \
    wget \
    curl \
    git \
    gcc \
    gcc-c++ \
    make \
    cmake \
    python3-pip \
    python3-devel \
    openssl-devel \
    libffi-devel \
    ca-certificates \
    tar \
    gzip \
    libatomic \
    && dnf clean all

# Install ROCm dependencies from nightly builds with cache
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --index-url ${ROCM_INDEX_URL} "rocm[libraries,devel]" && \
    uv pip install --index-url ${ROCM_INDEX_URL} --pre torch torchaudio torchvision

# Download and extract ROCm tarball
RUN --mount=type=cache,target=/var/cache/rocm-downloads \
    ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}') && \
    echo "Detected ROCm Version: $ROCM_VERSION" && \
    #
    # Fetch the ROCm tarball from cache or download
    #
    TARBALL="therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz" && \
    if [ -f "/var/cache/rocm-downloads/${TARBALL}" ]; then \
        echo "Using cached ROCm tarball" && \
        ln -s "/var/cache/rocm-downloads/${TARBALL}" . ; \
    else \
        echo "Downloading ROCm tarball" && \
        curl -#LO "https://therock-nightly-tarball.s3.amazonaws.com/${TARBALL}" && \
        cp "${TARBALL}" "/var/cache/rocm-downloads/${TARBALL}" ; \
    fi && \
    #
    # Extract tarball
    #
    echo "Extracting ROCm from ${TARBALL}" && \
    mkdir -p rocm-${ROCM_VERSION} && \
    tar xzf ${TARBALL} -C rocm-${ROCM_VERSION} && \
    rm ${TARBALL} && \
    #
    # Link ROCm version
    #
    echo "${ROCM_VERSION}" > /opt/rocm_version.txt && \
    ln -s /opt/vllm-build/rocm-${ROCM_VERSION} /opt/rocm-current


ENV ROCM_PATH=/opt/rocm-current
ENV LD_LIBRARY_PATH=$ROCM_PATH/lib
ENV DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV PYTORCH_ROCM_ARCH="gfx1151"
ENV CUDA_HOME=/opt/rocm-current
ENV VLLM_TORCH_COMPILE_LEVEL=0

# Validate ROCm installation
RUN AMD_SMI_VALIDATION=$($ROCM_PATH/bin/amd-smi) && \
    echo "AMD SMI Validation Output:" && \
    echo "$AMD_SMI_VALIDATION"

# Clone vllm
RUN --mount=type=cache,target=/root/.cache/git \
    echo "Cloning vllm repository..." && \
    git clone https://github.com/vllm-project/vllm.git

WORKDIR /opt/vllm-build/vllm

# Patch vllm
RUN sed -i '/from \.version import __version__/a import amdsmi' vllm/__init__.py

# Build vllm
RUN --mount=type=cache,target=/root/.cache/uv \
    echo "Building vllm..." && \
    export ROCM_PATH=/opt/rocm-current && \
    export LD_LIBRARY_PATH=/opt/rocm-current/lib && \
    export DEVICE_LIB_PATH=/opt/rocm-current/llvm/amdgcn/bitcode && \
    export HIP_DEVICE_LIB_PATH=/opt/rocm-current/llvm/amdgcn/bitcode && \
    export HIP_VISIBLE_DEVICES=-1 && \
    export ROCR_VISIBLE_DEVICES=-1 && \
    # Remove amdsmi if installed to avoid crash
    uv pip uninstall amdsmi && \
    uv pip install "numpy<2" && \
    python use_existing_torch.py && \
    uv pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm && \
    uv pip install -r requirements/rocm.txt && \
    python setup.py develop && \
    # Reinstall amdsmi from latest ROCm
    uv pip install ${ROCM_PATH}/share/amd_smi && \
    echo "vllm build complete."

# Clone and install flash-attention
RUN --mount=type=cache,target=/root/.cache/git \
    echo "Cloning flash-attention repository..." && \
    cd /opt/vllm-build && \
    git clone https://github.com/ROCm/flash-attention.git && \
    cd flash-attention && \
    git checkout main_perf

WORKDIR /opt/vllm-build/flash-attention

# Patch flash-attention for ROCm compatibility - v3 cache bust
RUN echo "Patching flash-attention for ROCm..." && \
    sed -i '/from wheel.bdist_wheel import bdist_wheel/a import amdsmi' setup.py && \
    sed -i '1i import amdsmi' flash_attn/__init__.py && \
    echo "Verifying patch applied:" && \
    head -3 flash_attn/__init__.py

ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
RUN echo "Installing flash-attention with ROCm support..." && \
    python setup.py develop


# build
docker build -t vllm-server-strix ./vllm -f ./vllm/Dockerfile.vllm
docker run --rm -it \
  -p 8000:8000 \
  -v /home/USER_PATH/models:/workspace/models:Z \
  --device=/dev/kfd \
  --device=/dev/dri \
  --ipc=host \
  --security-opt seccomp=unconfined \
  --cap-add SYS_PTRACE \
  --ulimit memlock=-1:-1 \
  --ulimit stack=67108864:67108864 \
  --group-add video \
  --group-add render \
  -e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
  -e HF_HOME=/workspace/models \
  -e VLLM_TORCH_COMPILE_LEVEL=3 \
  -e VLLM_ATTENTION_BACKEND=XFORMERS \
  -e HIP_VISIBLE_DEVICES=0 \
  -e GPU_DEVICE_ORDINAL=0 \
  services-vllm-server \
  vllm serve /workspace/models/data/cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit \
    --dtype float16 \
    --max-num-seqs 1 \
    --max-model-len 32768 \
    --enforce-eager \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.8 \
    --quantization compressed-tensors

However, I’m getting terrible performance, 4-5 tok/s

(APIServer pid=1) INFO 11-01 19:20:00 [loggers.py:215] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

1 Like

What OS/kernel are you running? I only tested on Fedora 43 with 6.17.5 kernel.
If you are running kernel below 6.16, it may be missing some important bits.

But it could be that vLLM introduced some breaking changes too between then and now. I’ll try to recompile tomorrow.

Something is definitely wrong. I was getting >30 t/s on that model (don’t remember exact number).

1 Like

Just throwing in some of my experience.

Setup

I just cloned the vllm repo and followed the description in the AMD rocm tab GPU - vLLM

It basically needs these commands:

DOCKER_BUILDKIT=1 docker build --build-arg ARG_PYTORCH_ROCM_ARCH=gfx1151 -f docker/Dockerfile.rocm -t vllm-rocm .

That takes a little while to build. Then I can start the container via (change <path/to/model>)

docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/models \
-e HF_HOME="/app/models" \
vllm-rocm

Within it I can then run vllm like i.e.


vllm in docker

vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8888 --host 0.0.0.0

The newer rocm base build seems to support gfx1151 alias strix halo as can be seen in Dockerfile.rocm_base on vllm repository. (I stupidly have build that one first until i realised that it is already the base for Dockerfile.rocm - it took some hours. Dont do it, it is not necessary :slight_smile: )


Issue

UNFORTUNATELY I get a black screen when prompting, and it leads to this error (even with other docker run options):

(APIServer pid=9) INFO:     10.0.0.3:48718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
HW Exception by GPU node-1 (Agent handle: 0x2095d450) reason :GPU Hang
HW Exception by GPU node-1 (Agent handle: 0x33efec90) reason :GPU Hang
Aborted (core dumped)

Some one seems to be on it already. Might be related to newest version. I read rocm 6.4.4 would be still the better way to go.

Besides that, I am wondering how much of an improvement vllm is for local setup, if switching models needs to happen more often. Since GGUF on llama.cpp seems to be much faster in loading a model, I might just go with it. Especially since there is a Qwen 3 VL GGUF since yesterday (havent tried though). What are your motivations to use vllm?

Main motivation is being able to run models that are not supported by llama.cpp yet. Until Friday it was mainly Qwen3-VL, but there is also Qwen3-Next, and all new models get VLLM support first anyway.

Second reason is if you want to serve concurrent users - vllm handles it better than llama.cpp.

There are some other reasons too, but generally, especially for solo use, llama.cpp is much faster.

1 Like

OK, I just checked, and I’m getting 16 t/s on my Strix Halo. Try to get rid of --enforce-eager, maybe it would help. If not, then it could be related to your kernel.

I’ve compiled your examples into a dockerfile so others can set this up.

This does work but it’s a bit flaky, the major issue is the flash attention mechanism.

# syntax=docker/dockerfile:1.4
FROM fedora:43
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

ARG ROCM_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx1151/

WORKDIR /opt/vllm-build
RUN uv venv --python 3.13
ENV VIRTUAL_ENV=/opt/vllm-build/.venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN dnf install -y \
    wget \
    curl \
    git \
    gcc \
    gcc-c++ \
    make \
    cmake \
    python3-pip \
    python3-devel \
    openssl-devel \
    libffi-devel \
    ca-certificates \
    tar \
    gzip \
    libatomic \
    && dnf clean all

# Install ROCm dependencies from nightly builds with cache
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --index-url ${ROCM_INDEX_URL} "rocm[libraries,devel]" && \
    uv pip install --index-url ${ROCM_INDEX_URL} --pre torch torchaudio torchvision

# Download and extract ROCm tarball
RUN --mount=type=cache,target=/var/cache/rocm-downloads \
    ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}') && \
    echo "Detected ROCm Version: $ROCM_VERSION" && \
    #
    # Fetch the ROCm tarball from cache or download
    #
    TARBALL="therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz" && \
    if [ -f "/var/cache/rocm-downloads/${TARBALL}" ]; then \
        echo "Using cached ROCm tarball" && \
        ln -s "/var/cache/rocm-downloads/${TARBALL}" . ; \
    else \
        echo "Downloading ROCm tarball" && \
        curl -#LO "https://therock-nightly-tarball.s3.amazonaws.com/${TARBALL}" && \
        cp "${TARBALL}" "/var/cache/rocm-downloads/${TARBALL}" ; \
    fi && \
    #
    # Extract tarball
    #
    echo "Extracting ROCm from ${TARBALL}" && \
    mkdir -p rocm-${ROCM_VERSION} && \
    tar xzf ${TARBALL} -C rocm-${ROCM_VERSION} && \
    rm ${TARBALL} && \
    #
    # Link ROCm version
    #
    echo "${ROCM_VERSION}" > /opt/rocm_version.txt && \
    ln -s /opt/vllm-build/rocm-${ROCM_VERSION} /opt/rocm-current


ENV ROCM_PATH=/opt/rocm-current
ENV LD_LIBRARY_PATH=$ROCM_PATH/lib
ENV DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV PYTORCH_ROCM_ARCH="gfx1151"
ENV CUDA_HOME=/opt/rocm-current
ENV VLLM_TORCH_COMPILE_LEVEL=0

# Validate ROCm installation
RUN AMD_SMI_VALIDATION=$($ROCM_PATH/bin/amd-smi) && \
    echo "AMD SMI Validation Output:" && \
    echo "$AMD_SMI_VALIDATION"

# Clone vllm
RUN --mount=type=cache,target=/root/.cache/git \
    echo "Cloning vllm repository..." && \
    git clone https://github.com/vllm-project/vllm.git

WORKDIR /opt/vllm-build/vllm

# Patch vllm
RUN sed -i '/from \.version import __version__/a import amdsmi' vllm/__init__.py

# Build vllm
RUN --mount=type=cache,target=/root/.cache/uv \
    echo "Building vllm..." && \
    export ROCM_PATH=/opt/rocm-current && \
    export LD_LIBRARY_PATH=/opt/rocm-current/lib && \
    export DEVICE_LIB_PATH=/opt/rocm-current/llvm/amdgcn/bitcode && \
    export HIP_DEVICE_LIB_PATH=/opt/rocm-current/llvm/amdgcn/bitcode && \
    export HIP_VISIBLE_DEVICES=-1 && \
    export ROCR_VISIBLE_DEVICES=-1 && \
    # Remove amdsmi if installed to avoid crash
    uv pip uninstall amdsmi && \
    uv pip install "numpy<2" && \
    python use_existing_torch.py && \
    uv pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm && \
    uv pip install -r requirements/rocm.txt && \
    python setup.py develop && \
    # Reinstall amdsmi from latest ROCm
    uv pip install ${ROCM_PATH}/share/amd_smi && \
    echo "vllm build complete."

# Clone and install flash-attention
RUN --mount=type=cache,target=/root/.cache/git \
    echo "Cloning flash-attention repository..." && \
    cd /opt/vllm-build && \
    git clone https://github.com/ROCm/flash-attention.git && \
    cd flash-attention && \
    git checkout main_perf

WORKDIR /opt/vllm-build/flash-attention

# Patch flash-attention for ROCm compatibility - v3 cache bust
RUN echo "Patching flash-attention for ROCm..." && \
    sed -i '/from wheel.bdist_wheel import bdist_wheel/a import amdsmi' setup.py && \
    sed -i '1i import amdsmi' flash_attn/__init__.py && \
    echo "Verifying patch applied:" && \
    head -3 flash_attn/__init__.py

ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
RUN echo "Installing flash-attention with ROCm support..." && \
    python setup.py develop


# build
docker build -t vllm-server-strix ./vllm -f ./vllm/Dockerfile.vllm
docker run --rm -it \
  -p 8000:8000 \
  -v /home/USER_PATH/models:/workspace/models:Z \
  --device=/dev/kfd \
  --device=/dev/dri \
  --ipc=host \
  --security-opt seccomp=unconfined \
  --cap-add SYS_PTRACE \
  --ulimit memlock=-1:-1 \
  --ulimit stack=67108864:67108864 \
  --group-add video \
  --group-add render \
  -e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
  -e HF_HOME=/workspace/models \
  -e VLLM_TORCH_COMPILE_LEVEL=3 \
  -e VLLM_ATTENTION_BACKEND=XFORMERS \
  -e HIP_VISIBLE_DEVICES=0 \
  -e GPU_DEVICE_ORDINAL=0 \
  services-vllm-server \
  vllm serve /workspace/models/data/cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit \
    --dtype float16 \
    --max-num-seqs 1 \
    --max-model-len 32768 \
    --enforce-eager \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.8 \
    --quantization compressed-tensors

However, I’m getting terrible performance, 4-5 tok/s

(APIServer pid=1) INFO 11-01 19:20:00 [loggers.py:215] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

I was getting GPU hang errors without it. I’ll give this another try next week I think

Your guide was super great, glad to know there was indeed an issue with the token gen, it appeared to only use about 20% of GPU at peak.

6.17.5-200.fc42.x86_64 on Fedora 42

you use docker on fedora? not podman?

OK, the same kernel, so that’s good. I don’t use Docker or even Podman - maybe this contributes as well.

Qwen3-Next-80b, awq-4bit, Avg generation throughput: 15.5 tokens/s

Yep, that’s about right.

same as urs,

pytorch/ao: PyTorch native quantization and sparsity for training and inference

need this library to make it work