Finally put together a recipe for compiling VLLM from source on Strix Halo. Please note that while it worked yesterday, it may stop working tomorrow if they introduce any breaking changes. Please let me know.
Building vLLM for AMD Strix Halo
This is a recipe tested on Fedora 43 beta, but should work in any other recent Linux distro.
Prepare environment
First, install uv if you don’t have it on your system yet: Installation | uv
I prefer the pipx route, but you can use any method that works.
After uv is installed, prepare virtual environment for Python:
mkdir ~/vllm
cd ~/vllm
uv venv --python 3.13
source .venv/bin/activate
Install ROCm Python packages
First, we’ll install fresh nightly builds from TheRock:
uv pip install \
--index-url https://rocm.nightlies.amd.com/v2/gfx1151/ \
"rocm[libraries,devel]"
uv pip install \
--index-url https://rocm.nightlies.amd.com/v2/gfx1151/ \
--pre torch torchaudio torchvision
Then, download full ROCm tarball that corresponds to the version pyTorch is compiled against:
ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}')
# check that it works correctly
# example output: Detected ROCm Version: 7.10.0a20251015
echo "Detected ROCm Version: $ROCM_VERSION"
# if you see the version number, download nightly tarball for that version
wget "https://therock-nightly-tarball.s3.amazonaws.com/therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz"
Extract the tarball. You can use any directory, I’m just using the current directory to make it all self-contained:
mkdir rocm-${ROCM_VERSION}
tar xzf therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz -C rocm-${ROCM_VERSION}
Configure environment variables:
export ROCM_PATH=${PWD}/rocm-$ROCM_VERSION
export LD_LIBRARY_PATH=$ROCM_PATH/lib
export DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
export HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
export PYTORCH_ROCM_ARCH="gfx1151"
Check if ROCm works:
$ROCM_PATH/bin/amd-smi
You should be able to see something like this:
+------------------------------------------------------------------------------+
| AMD-SMI 26.1.0+c9ffff43 amdgpu version: Linuxver ROCm version: 7.10.0 |
| VBIOS version: 023.011.000.039.000001 |
| Platform: Linux Baremetal |
|-------------------------------------+----------------------------------------|
| BDF GPU-Name | Mem-Uti Temp UEC Power-Usage |
| GPU HIP-ID OAM-ID Partition-Mode | GFX-Uti Fan Mem-Usage |
|=====================================+========================================|
| 0000:c5:00.0 Radeon 8060S Graphics | N/A N/A 0 N/A/0 W |
| 0 0 N/A N/A | N/A N/A 147/1024 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes: |
| GPU PID Process Name GTT_MEM VRAM_MEM MEM_USAGE CU % |
|==============================================================================|
| No running processes found |
+------------------------------------------------------------------------------+
Checkout VLLM repository:
git clone https://github.com/vllm-project/vllm.git
cd vllm
Apply a workaround for amdsmi Python package crash - if it’s imported before pyTorch, the crash is not happening:
echo "diff --git a/vllm/__init__.py b/vllm/__init__.py
index 19b2cdc67..efb2526fe 100644
--- a/vllm/__init__.py
+++ b/vllm/__init__.py
@@ -6,6 +6,7 @@
# version library first. Such assumption is critical for some customization.
from .version import __version__, __version_tuple__ # isort:skip
+import amdsmi
import typing
# The environment variables override should be imported before any other" | patch -p1
Check if patch is applied:
git diff
You should see this:
diff --git a/vllm/__init__.py b/vllm/__init__.py
index 19b2cdc67..efb2526fe 100644
--- a/vllm/__init__.py
+++ b/vllm/__init__.py
@@ -6,6 +6,7 @@
# version library first. Such assumption is critical for some customization.
from .version import __version__, __version_tuple__ # isort:skip
+import amdsmi
import typing
# The environment variables override should be imported before any other
If everything is alright, install dependencies and build vllm:
# Uninstall amdsmi, will install later - CRASHING AS OF 10/24/2025
uv pip uninstall amdsmi
# use existing Torch
python use_existing_torch.py
# Install dependencies
uv pip install --upgrade numba \
scipy \
huggingface-hub[cli,hf_transfer] \
setuptools_scm
uv pip install "numpy<2"
uv pip install -r requirements/rocm.txt
# Build vLLM
python setup.py develop
# Reinstall amd_smi
uv pip install /opt/rocm/share/amd_smi
It will generate some warnings, but you can ignore them as long as the command completes successfully.
To test:
vllm --version
Install Flash Attention
cd ..
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
git checkout main_perf
Patch setup.py:
echo "diff --git a/setup.py b/setup.py
index d54e93f6..f7d282df 100644
--- a/setup.py
+++ b/setup.py
@@ -19,6 +19,7 @@ import urllib.request
import urllib.error
from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
+import amdsmi
import torch
from torch.utils.cpp_extension import (
BuildExtension," | patch -p1
Build:
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install
cd ..
You can then run VLLM like this:
vllm serve cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit \
--dtype float16 --max-model-len 32768 \
--allowed-local-media-path / \
--limit-mm-per-prompt '{"image": 3, "video": 1}' \
-tp 1 --max-num-seqs 1 \
--port 8888 --host 0.0.0.0
Configure environment
Create a script ~/vllm/vllm_env.sh:
#!/bin/bash
source .venv/bin/activate
ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}')
export ROCM_PATH=${PWD}/rocm-$ROCM_VERSION
export LD_LIBRARY_PATH=$ROCM_PATH/lib
export DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
export HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
export PYTORCH_ROCM_ARCH="gfx1151"
Running VLLM
Before running VLLM you need to activate the virtual environment and set the variables first (unless you’ve already done that in the current session). Just run:
cd ~/vllm
source vllm_env.sh
And then you can run vllm!
Known issues
FP8 models don’t work yet. Use AWQ 8-bit quants instead.
MXFP4 is also not supported yet (gpt-oss models).
BF16 models work fine.
Include --dtype float16 in your parameter list for better performance.
If experiencing HIP-related crashes, try to limit max-num-seqs, e.g.:
vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 8888 --host 0.0.0.0 --max-model-len 32768 --max-num-seqs 10
Tested Models
- Qwen/Qwen3-VL-4B-Instruct
- cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit
- cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-8bit
- cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit - crashes without max-num-seqs limitation
vllm serve cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 8888 --host 0.0.0.0 --max-model-len 32768 --max-num-seqs 10