hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (including Strix Halo)

Hey everyone, I wanted to share GitHub - shisa-ai/hipEngine · GitHub - a new open source (from-scratch) inference engine that is a pure AMD ROCm implementation (all hot-path code is HIP/C++, all AMD libraries, PyTorch is expressly not a dependency) for RDNA3 GPUs.

It’s been mostly tuned on a spare gfx1100 (W7900, 7900 XTX) GPU, but I did an initial pass for gfx1151 (Strix Halo) support, and it ends up being faster than llama.cpp (HIP or Vulkan) basically across the board for Qwen 3.6 35B-A3B (basically the only model supported atm - you can think of hipEngine currently as closer to something like antirez’s DS4 than llama.cpp).

If you’re running a llama.cpp Vulkan variant, you should expect about 10% faster decode/token generation, and >2X prefill/prompt processing (much faster agentic/coding performance). Versus llama.cpp HIP, it’s up to 10% faster prefill/prompt processing (faster as context gets longer), and up to 30% faster decode/token generation.

Prefill tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 983.206 1058.738 638.008
4K/128 1029.402 1004.220 595.400
32K/128 792.296 735.534 407.984
128K/128 413.489 376.070 181.453

Decode tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 62.060 50.537 57.615
4K/128 63.605 49.379 55.027
32K/128 50.629 43.435 44.576
128K/128 30.245 31.286 26.935

I announced this on reddit last week to get some testers/eyeballs/feedback on this so it should be in good shape. If anyone tries it out, feel free to drop feedback or file an issue in the Github if you run into any problems.

  • There has been almost no Strix Halo specific tuning. I have some dedicated test hardware coming soon from Framework though and expect that I can squeeze more out of gfx1151 with some dedicated compute time.
  • There is decently fast Qwen 3.6 dense support, althought MTP/DFlash is still forthcoming so you may be better off with llama.cpp for that
  • I’m cooking up some new model support (StepFun 3.7, Gemma 4) and am open to requests.
  • This started off as a sort of thought experiment, but turned out to be worth sharing. I am working on c>1 perf now, and I will be porting some of my kvcache work, etc in my spare time. There’s not really a roadmap or anything, this is just for fun, but maybe it’ll be useful for some people!
  • The code is AGPLv3 (share-alike for reals), but I’ve also published a fair amount of docs/ that should be useful for anyone interested in RDNA3 GPU development, and includes extensive details on the AI-assisted kernel optimization approach used.
12 Likes

I did try it, but MTP beat it in my local setup (with nim coding etc) by about 30%. It seemed faster for the default, but couldnt’ handle mtp. Sadness.

MTP/DFlash are WIP atm, although it’s more of a grind than expected (verification is a bottleneck). Also, c>1 and a few other things (StepFun 3.7) are grinding away.

I’ll do a proper update soon now that I’m back from some out of town travel but a few good things:

  • I have a dedicated gfx1151 board courtesy of Framework now to do ongoing kernel grinding specifically for Strix Halo. Expect to see some big performance gains soon

  • MTP/DFlash is rapidly improving. On W7900/gfx11 we are now positive on 27B dense DFlash by 1.23X - 32.6 tok/s → 40.1 tok/s hipEngine/benchmarks at main · shisa-ai/hipEngine · GitHub

  • Concurrency, while not amazing, is also now running. I expect gfx1151 to benefit more than gfx1100. llama.cpp really falls apart at c>1 so this should be a durable improvement for anyone looking for fast multiuser/agent support: GitHub - shisa-ai/hipEngine · GitHub

3 Likes

A basic question: by “board” here, do you mean a standard Framework Desktop motherboard, or is this something that plugs into the PCI slot?

Standard bare motherboard, I’ve just plugged it in, so I’ll have this running in the corner and just grinding.

BTW @Thomas_Munn I just finished my MTP optimization pass. >AR is very hard for the Qwen 3.5 MoE (combinations of experts, attention, linear layers) but it is outperforming llama.cpp now so I assume that once I dig into the gfx1151, there will be a benefit.

Reproducible sweep script:

  hipEngine:scripts/llamacpp_vulkan_mtp_sweep.py

llama.cpp build:

  b9600 (263cc04a5)
  git describe: b9596-4-g263cc04a5

W7900 / Vulkan0, D32 prompt suite

engine / mode mean decode tok/s vs llama base MTP/AR speedup
llama.cpp Vulkan base 54.23 1.000x 1.000x
llama.cpp Vulkan B1 43.20 0.797x 0.797x
llama.cpp Vulkan B2 47.65 0.879x 0.879x
llama.cpp Vulkan B3 50.21 0.926x 0.926x
llama.cpp Vulkan B4 50.31 0.928x 0.928x
hipEngine current B1 113.39 2.091x 1.023x prompt-mean, 1.014x total-time

Best llama.cpp MTP: B4 at 50.31 tok/s.
hipEngine vs best llama.cpp MTP: about 2.25x.

Repro commands

llama.cpp Vulkan W7900:

  cd /home/lhl/hipEngine
  python3 scripts/llamacpp_vulkan_mtp_sweep.py \
    --gpu 0 \
    --max-tokens 32 \
    --draft-max-values 1,2,3,4 \
    --out-dir /tmp/llamacpp-mtp35-sweep-full-32

Also, here’s what c>1 looks like:

c hipEngine agg tok/s hipEngine per-seq llama.cpp Vulkan agg tok/s llama.cpp per-seq hipEngine / llama.cpp
1 116.68 116.68 106.47 106.47 1.10x
2 113.45 56.73 159.19 79.59 0.71x
4 156.03 39.01 70.44 17.61 2.21x
8 188.69 23.59 26.26 3.28 7.19x

llama.cpp’s c=2 is great actually, but it dies above that - at c=8, llama.cpp’s aggregate throughput is basically equal to hipEngine’s per-sequence performance (hipEngine is 7.2X faster total throughput).

1 Like