vLLM on Framework Desktop + 4-month AI Performance Updates

I’ve released a new video on running vLLM on Strix Halo (Ryzen AI MAX 395). It also covers a 4-month progress update on software support and general AI performance for the platform.

14 Likes

Thanks for all your great work man!!!

amazing work! I spent way too long with chatGPT trying to get a podman container of vllm working with my box running the framework bazzite image as the OS.

bazzite:stable 
Bazzite
Linux 6.17.7-ba20.fc43.x86_64
$ sudo rpm-ostree kargs --append=ttm.pages_limit=27648000 --append=ttm.page_pool_size=27648000 --append=amd_iommu=off

I went a bit conservative there as I wanted wayland to have some headroom too.

Manually pulling key bits from your repo to put into a script for podman was what finally got it up and running. Hopefully this is useful for others:

podman run --rm \
  --name kyuz0-vllm-sysadmin \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  --security-opt label=disable \
  -v /var/mnt/mdata/podman/vllm:/model:Z \
  -p 8000:8000 \
  --ipc=host \
  -e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
  -e VLLM_USE_TRITON_FLASH_ATTN=0 \
  -e FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" \
  -e VLLM_TARGET_DEVICE=rocm \
  -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
  -e VLLM_USE_MMAP=0 \
  -e VLLM_ROCM_USE_MMAP_FOR_TRITON=0 \
  -e ROCM_ENABLE_COMGR_DEBUG=0 \
  -e ROCM_DISABLE_LRZ_KERNEL=1 \
  -e VLLM_ROCM_USE_AITER=0 \
  -e VLLM_ROCM_USE_AITER_MOE=0 \
  -e VLLM_USE_TRITON_AWQ=1 \
  docker.io/kyuz0/vllm-therock-gfx1151:latest \
  vllm serve openai/gpt-oss-120b \
      --host 0.0.0.0 \
      --port 8000 \
      --download-dir /model \
      --gpu-memory-utilization 0.95 \
      --max-model-len 131072 \
      --max-num-seqs 1 \
      --dtype=auto \
      --trust-remote-code \
      --tensor-parallel-size=1 

also if i have missed anything or screwed any of it up, please yell out!

it’s currently being served to Void IDE running on my same desktop. not being too performant, but I think these results are it running without the triton mmap turned off (was a late addition)

(APIServer pid=1) INFO 12-28 08:15:48 [loggers.py:257] Engine 000: Avg prompt throughput: 153.4 tokens/s, Avg generation throughput: 2.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     x.x.x.x:35972 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     x.x.x.x:35976 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO 12-28 08:15:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     x.x.x.x:44864 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     x.x.x.x:44870 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO 12-28 08:16:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     x.x.x.x:57316 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     x.x.x.x:56850 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     x.x.x.x:56858 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO 12-28 08:16:18 [loggers.py:257] Engine 000: Avg prompt throughput: 160.3 tokens/s, Avg generation throughput: 7.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 29.1%
(APIServer pid=1) INFO:     x.x.x.x:45090 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     x.x.x.x:45106 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO 12-28 08:16:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 29.1%
(APIServer pid=1) INFO:     x.x.x.x:35522 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     x.x.x.x:35530 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO 12-28 08:16:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 29.1%

Nice video, love the technical details and deep dives. Definitely what I was missing in more consumer-oriented reviews!

As a side note, it would be really interesting to see how the Burn framework performs on Strix Halo, since it supports ROCm as a backend and noticeably can JIT compile and autotune kernels to fit the architecture. Not just for hot parts, but the whole data pipeline. They even achieved SOTA matmul for certain workloads.

It’s still a relatively young project so it’s not immediately obvious how to make use of it, but I know that people were already implementing LLM models using Burn, not to mention Burn-LM project from its authors.