On and off over the past couple months I’ve been doing ML/AI/LLM testing on Strix Halo (Ryzen AI Max+ 395), specifically with the gfx1151 GPU, in my spare time.
I was given the go ahead a while back to mention that this has been on pre-production Framework Desktop hardware, but since I only just recently finished this latest set of sweeps, and with release I assume getting pretty close (nope, I don’t know anything, don’t ask, lol), I figure I’d share the most detailed/definitive LLM inference performance testing that’s been done on the Ryzen AI Max so far.
A few things that differentiate vs any prior testing that I’ve seen:
- Run on the latest software - Linux 6.15.5+ w/ the latest linux-firmware, BIOS/EC, TheRock ROCm nightly releases w/ gfx1151 targeted kernels and also recent llama.cpp builds, built directly from source
- Testing of multiple backends and flags including HIP w/ rocBLAS and hipBLASLt, Vulkan w/ 2^n batching tests for MoEs, multiple MoE and dense model architectures and quants
- Full sweeps of pp (compute bound), tg (memory-bandwidth bound), and memory usage (w/ and w/o FA)
These tests of course use llama-bench
so that they are repeatable and statistically valid as well (default 5 runs).
For those interested for information on specific models, the raw data and individual sweeps for each model are available here (in chart and graph form): https://github.com/lhl/strix-halo-testing/tree/main/llm-bench
For everyone else, here’s the topline results:
Strix Halo LLM Benchmark Results
All testing was done on pre-production Framework Desktop systems with an AMD Ryzen Max+ 395 (Strix Halo)/128GB LPDDR5x-8000 configuration. (Thanks Nirav, Alexandru, and co!)
Exact testing/system details are in the results folders, but roughly these are running:
- Close to production BIOS/EC
- Relatively up-to-date kernels: 6.15.5-arch1-1/6.15.6-arch1-1
- Recent TheRock/ROCm-7.0 nightly builds with Strix Halo (gfx1151) kernels
- Recent llama.cpp builds (eg b5863 from 2005-07-10)
Just to get a ballpark on the hardware:
- ~215 GB/s max GPU MBW out of a 256 GB/s theoretical (256-bit 8000 MT/s)
- theoretical 59 FP16 TFLOPS (VPOD/WMMA) on RDNA 3.5 (gfx11); effective is much lower
Results
Prompt Processing (pp) Performance
Model Name | Architecture | Weights (B) | Active (B) | Backend | Flags | pp512 | tg128 | Memory (Max MiB) |
---|---|---|---|---|---|---|---|---|
Llama 2 7B Q4_0 | Llama 2 | 7 | 7 | Vulkan | 998.0 | 46.5 | 4237 | |
Llama 2 7B Q4_K_M | Llama 2 | 7 | 7 | HIP | hipBLASLt | 906.1 | 40.8 | 4720 |
Shisa V2 8B i1-Q4_K_M | Llama 3 | 8 | 8 | HIP | hipBLASLt | 878.2 | 37.2 | 5308 |
Qwen 3 30B-A3B UD-Q4_K_XL | Qwen 3 MoE | 30 | 3 | Vulkan | fa=1 | 604.8 | 66.3 | 17527 |
Mistral Small 3.1 UD-Q4_K_XL | Mistral 3 | 24 | 24 | HIP | hipBLASLt | 316.9 | 13.6 | 14638 |
Hunyuan-A13B UD-Q6_K_XL | Hunyuan MoE | 80 | 13 | Vulkan | fa=1 | 270.5 | 17.1 | 68785 |
Llama 4 Scout UD-Q4_K_XL | Llama 4 MoE | 109 | 17 | HIP | hipBLASLt | 264.1 | 17.2 | 59720 |
Qwen 3 32B Q8_0 | Qwen 3 | 32 | 32 | HIP | hipBLASLt | 226.1 | 6.4 | 33683 |
Shisa V2 70B i1-Q4_K_M | Llama 3 | 70 | 70 | HIP rocWMMA | 94.7 | 4.5 | 41522 | |
dots1 UD-Q4_K_XL | dots1 MoE | 142 | 14 | Vulkan | fa=1 b=256 | 63.1 | 20.6 | 84077 |
Text Generation (tg) Performance
Model Name | Architecture | Weights (B) | Active (B) | Backend | Flags | pp512 | tg128 | Memory (Max MiB) |
---|---|---|---|---|---|---|---|---|
Qwen 3 30B-A3B UD-Q4_K_XL | Qwen 3 MoE | 30 | 3 | Vulkan | b=256 | 591.1 | 72.0 | 17377 |
Llama 2 7B Q4_K_M | Llama 2 | 7 | 7 | Vulkan | fa=1 | 620.9 | 47.9 | 4463 |
Llama 2 7B Q4_0 | Llama 2 | 7 | 7 | Vulkan | fa=1 | 1014.1 | 45.8 | 4219 |
Shisa V2 8B i1-Q4_K_M | Llama 3 | 8 | 8 | Vulkan | fa=1 | 614.2 | 42.0 | 5333 |
dots1 UD-Q4_K_XL | dots1 MoE | 142 | 14 | Vulkan | fa=1 b=256 | 63.1 | 20.6 | 84077 |
Llama 4 Scout UD-Q4_K_XL | Llama 4 MoE | 109 | 17 | Vulkan | fa=1 b=256 | 146.1 | 19.3 | 59917 |
Hunyuan-A13B UD-Q6_K_XL | Hunyuan MoE | 80 | 13 | Vulkan | fa=1 b=256 | 223.9 | 17.1 | 68608 |
Mistral Small 3.1 UD-Q4_K_XL | Mistral 3 | 24 | 24 | Vulkan | fa=1 | 119.6 | 14.3 | 14540 |
Qwen 3 32B Q8_0 | Qwen 3 | 32 | 32 | Vulkan | fa=1 | 101.8 | 6.4 | 33886 |
Shisa V2 70B i1-Q4_K_M | Llama 3 | 70 | 70 | Vulkan | fa=1 | 26.4 | 5.0 | 41456 |
Testing Notes
The best overall backend and flags were chosen for each model family tested. You can see that often times the best backend for prefill vs token generation differ. Full results for each model (including the pp/tg graphs for different context lengths for all tested backend variations) are available for review in their respective folders as which backend is the best performing will depend on your exact use-case.
There’s a lot of performance still on the table when it comes to pp especially. Since these results should be close to optimal for when they were tested, I might add dates to the table (adding kernel, ROCm, and llama.cpp build#'s might be a bit much).
For additional discussion/reference:
- Posted for discussion on r/LocalLlama
- Also, initial testing I did a while back (perf has actually improved a fair amount since then).
- Also WIP docs and other AI/ML notes for Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo