hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (including Strix Halo)

lhl · May 29, 2026, 8:53pm

Hey everyone, I wanted to share GitHub - shisa-ai/hipEngine · GitHub - a new open source (from-scratch) inference engine that is a pure AMD ROCm implementation (all hot-path code is HIP/C++, all AMD libraries, PyTorch is expressly not a dependency) for RDNA3 GPUs.

It’s been mostly tuned on a spare gfx1100 (W7900, 7900 XTX) GPU, but I did an initial pass for gfx1151 (Strix Halo) support, and it ends up being faster than llama.cpp (HIP or Vulkan) basically across the board for Qwen 3.6 35B-A3B (basically the only model supported atm - you can think of hipEngine currently as closer to something like antirez’s DS4 than llama.cpp).

If you’re running a llama.cpp Vulkan variant, you should expect about 10% faster decode/token generation, and >2X prefill/prompt processing (much faster agentic/coding performance). Versus llama.cpp HIP, it’s up to 10% faster prefill/prompt processing (faster as context gets longer), and up to 30% faster decode/token generation.

Prefill tok/s

Workload	hipEngine PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	983.206	1058.738	638.008
4K/128	1029.402	1004.220	595.400
32K/128	792.296	735.534	407.984
128K/128	413.489	376.070	181.453

Decode tok/s

Workload	hipEngine PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	62.060	50.537	57.615
4K/128	63.605	49.379	55.027
32K/128	50.629	43.435	44.576
128K/128	30.245	31.286	26.935

I announced this on reddit last week to get some testers/eyeballs/feedback on this so it should be in good shape. If anyone tries it out, feel free to drop feedback or file an issue in the Github if you run into any problems.

There has been almost no Strix Halo specific tuning. I have some dedicated test hardware coming soon from Framework though and expect that I can squeeze more out of gfx1151 with some dedicated compute time.
There is decently fast Qwen 3.6 dense support, althought MTP/DFlash is still forthcoming so you may be better off with llama.cpp for that
I’m cooking up some new model support (StepFun 3.7, Gemma 4) and am open to requests.
This started off as a sort of thought experiment, but turned out to be worth sharing. I am working on c>1 perf now, and I will be porting some of my kvcache work, etc in my spare time. There’s not really a roadmap or anything, this is just for fun, but maybe it’ll be useful for some people!
The code is AGPLv3 (share-alike for reals), but I’ve also published a fair amount of docs/ that should be useful for anyone interested in RDNA3 GPU development, and includes extensive details on the AI-assisted kernel optimization approach used.

Thomas_Munn · June 3, 2026, 10:43pm

I did try it, but MTP beat it in my local setup (with nim coding etc) by about 30%. It seemed faster for the default, but couldnt’ handle mtp. Sadness.

lhl · June 6, 2026, 11:53pm

MTP/DFlash are WIP atm, although it’s more of a grind than expected (verification is a bottleneck). Also, c>1 and a few other things (StepFun 3.7) are grinding away.

lhl · June 11, 2026, 2:13am

I’ll do a proper update soon now that I’m back from some out of town travel but a few good things:

I have a dedicated gfx1151 board courtesy of Framework now to do ongoing kernel grinding specifically for Strix Halo. Expect to see some big performance gains soon
MTP/DFlash is rapidly improving. On W7900/gfx11 we are now positive on 27B dense DFlash by 1.23X - 32.6 tok/s → 40.1 tok/s hipEngine/benchmarks at main · shisa-ai/hipEngine · GitHub
Concurrency, while not amazing, is also now running. I expect gfx1151 to benefit more than gfx1100. llama.cpp really falls apart at c>1 so this should be a durable improvement for anyone looking for fast multiuser/agent support: GitHub - shisa-ai/hipEngine · GitHub

Squiggler · June 11, 2026, 7:05pm

A basic question: by “board” here, do you mean a standard Framework Desktop motherboard, or is this something that plugs into the PCI slot?

lhl · June 13, 2026, 12:03am

Standard bare motherboard, I’ve just plugged it in, so I’ll have this running in the corner and just grinding.

BTW @Thomas_Munn I just finished my MTP optimization pass. >AR is very hard for the Qwen 3.5 MoE (combinations of experts, attention, linear layers) but it is outperforming llama.cpp now so I assume that once I dig into the gfx1151, there will be a benefit.

Reproducible sweep script:

  hipEngine:scripts/llamacpp_vulkan_mtp_sweep.py

llama.cpp build:

  b9600 (263cc04a5)
  git describe: b9596-4-g263cc04a5

W7900 / Vulkan0, D32 prompt suite

engine / mode	mean decode tok/s	vs llama base	MTP/AR speedup
llama.cpp Vulkan base	54.23	1.000x	1.000x
llama.cpp Vulkan B1	43.20	0.797x	0.797x
llama.cpp Vulkan B2	47.65	0.879x	0.879x
llama.cpp Vulkan B3	50.21	0.926x	0.926x
llama.cpp Vulkan B4	50.31	0.928x	0.928x
hipEngine current B1	113.39	2.091x	1.023x prompt-mean, 1.014x total-time

Best llama.cpp MTP: B4 at 50.31 tok/s.
hipEngine vs best llama.cpp MTP: about 2.25x.

Repro commands

llama.cpp Vulkan W7900:

  cd /home/lhl/hipEngine
  python3 scripts/llamacpp_vulkan_mtp_sweep.py \
    --gpu 0 \
    --max-tokens 32 \
    --draft-max-values 1,2,3,4 \
    --out-dir /tmp/llamacpp-mtp35-sweep-full-32

Also, here’s what c>1 looks like:

c	hipEngine agg tok/s	hipEngine per-seq	llama.cpp Vulkan agg tok/s	llama.cpp per-seq	hipEngine / llama.cpp
1	116.68	116.68	106.47	106.47	1.10x
2	113.45	56.73	159.19	79.59	0.71x
4	156.03	39.01	70.44	17.61	2.21x
8	188.69	23.59	26.26	3.28	7.19x

llama.cpp’s c=2 is great actually, but it dies above that - at c=8, llama.cpp’s aggregate throughput is basically equal to hipEngine’s per-sequence performance (hipEngine is 7.2X faster total throughput).

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	23210	September 29, 2025
[HOW-TO] Compiling VLLM from source on Strix Halo Framework Desktop ai	59	7950	January 7, 2026
[TRACKING] Request: verify dGPU support Framework Desktop compatibility	226	13791	April 18, 2026
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	57	11246	June 21, 2026
LLM Performance Framework Desktop ai	26	9584	June 11, 2025

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (including Strix Halo)

Prefill tok/s

Decode tok/s

Related topics