Hey everyone, I wanted to share GitHub - shisa-ai/hipEngine · GitHub - a new open source (from-scratch) inference engine that is a pure AMD ROCm implementation (all hot-path code is HIP/C++, all AMD libraries, PyTorch is expressly not a dependency) for RDNA3 GPUs.
It’s been mostly tuned on a spare gfx1100 (W7900, 7900 XTX) GPU, but I did an initial pass for gfx1151 (Strix Halo) support, and it ends up being faster than llama.cpp (HIP or Vulkan) basically across the board for Qwen 3.6 35B-A3B (basically the only model supported atm - you can think of hipEngine currently as closer to something like antirez’s DS4 than llama.cpp).
If you’re running a llama.cpp Vulkan variant, you should expect about 10% faster decode/token generation, and >2X prefill/prompt processing (much faster agentic/coding performance). Versus llama.cpp HIP, it’s up to 10% faster prefill/prompt processing (faster as context gets longer), and up to 30% faster decode/token generation.
Prefill tok/s
Workload
hipEngine PARO
llama.cpp HIP
llama.cpp Vulkan
512/128
983.206
1058.738
638.008
4K/128
1029.402
1004.220
595.400
32K/128
792.296
735.534
407.984
128K/128
413.489
376.070
181.453
Decode tok/s
Workload
hipEngine PARO
llama.cpp HIP
llama.cpp Vulkan
512/128
62.060
50.537
57.615
4K/128
63.605
49.379
55.027
32K/128
50.629
43.435
44.576
128K/128
30.245
31.286
26.935
I announced this on reddit last week to get some testers/eyeballs/feedback on this so it should be in good shape. If anyone tries it out, feel free to drop feedback or file an issue in the Github if you run into any problems.
There has been almost no Strix Halo specific tuning. I have some dedicated test hardware coming soon from Framework though and expect that I can squeeze more out of gfx1151 with some dedicated compute time.
There is decently fast Qwen 3.6 dense support, althought MTP/DFlash is still forthcoming so you may be better off with llama.cpp for that
I’m cooking up some new model support (StepFun 3.7, Gemma 4) and am open to requests.
This started off as a sort of thought experiment, but turned out to be worth sharing. I am working on c>1 perf now, and I will be porting some of my kvcache work, etc in my spare time. There’s not really a roadmap or anything, this is just for fun, but maybe it’ll be useful for some people!
The code is AGPLv3 (share-alike for reals), but I’ve also published a fair amount of docs/ that should be useful for anyone interested in RDNA3 GPU development, and includes extensive details on the AI-assisted kernel optimization approach used.
I did try it, but MTP beat it in my local setup (with nim coding etc) by about 30%. It seemed faster for the default, but couldnt’ handle mtp. Sadness.
MTP/DFlash are WIP atm, although it’s more of a grind than expected (verification is a bottleneck). Also, c>1 and a few other things (StepFun 3.7) are grinding away.
I’ll do a proper update soon now that I’m back from some out of town travel but a few good things:
I have a dedicated gfx1151 board courtesy of Framework now to do ongoing kernel grinding specifically for Strix Halo. Expect to see some big performance gains soon
Concurrency, while not amazing, is also now running. I expect gfx1151 to benefit more than gfx1100. llama.cpp really falls apart at c>1 so this should be a durable improvement for anyone looking for fast multiuser/agent support: GitHub - shisa-ai/hipEngine · GitHub
Standard bare motherboard, I’ve just plugged it in, so I’ll have this running in the corner and just grinding.
BTW @Thomas_Munn I just finished my MTP optimization pass. >AR is very hard for the Qwen 3.5 MoE (combinations of experts, attention, linear layers) but it is outperforming llama.cpp now so I assume that once I dig into the gfx1151, there will be a benefit.
llama.cpp’s c=2 is great actually, but it dies above that - at c=8, llama.cpp’s aggregate throughput is basically equal to hipEngine’s per-sequence performance (hipEngine is 7.2X faster total throughput).