AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests

On and off over the past couple months I’ve been doing ML/AI/LLM testing on Strix Halo (Ryzen AI Max+ 395), specifically with the gfx1151 GPU, in my spare time.

I was given the go ahead a while back to mention that this has been on pre-production Framework Desktop hardware, but since I only just recently finished this latest set of sweeps, and with release I assume getting pretty close (nope, I don’t know anything, don’t ask, lol), I figure I’d share the most detailed/definitive LLM inference performance testing that’s been done on the Ryzen AI Max so far.

A few things that differentiate vs any prior testing that I’ve seen:

  • Run on the latest software - Linux 6.15.5+ w/ the latest linux-firmware, BIOS/EC, TheRock ROCm nightly releases w/ gfx1151 targeted kernels and also recent llama.cpp builds, built directly from source
  • Testing of multiple backends and flags including HIP w/ rocBLAS and hipBLASLt, Vulkan w/ 2^n batching tests for MoEs, multiple MoE and dense model architectures and quants
  • Full sweeps of pp (compute bound), tg (memory-bandwidth bound), and memory usage (w/ and w/o FA)

These tests of course use llama-bench so that they are repeatable and statistically valid as well (default 5 runs).

For those interested for information on specific models, the raw data and individual sweeps for each model are available here (in chart and graph form): https://github.com/lhl/strix-halo-testing/tree/main/llm-bench

For everyone else, here’s the topline results:

Strix Halo LLM Benchmark Results

All testing was done on pre-production Framework Desktop systems with an AMD Ryzen Max+ 395 (Strix Halo)/128GB LPDDR5x-8000 configuration. (Thanks Nirav, Alexandru, and co!)

Exact testing/system details are in the results folders, but roughly these are running:

  • Close to production BIOS/EC
  • Relatively up-to-date kernels: 6.15.5-arch1-1/6.15.6-arch1-1
  • Recent TheRock/ROCm-7.0 nightly builds with Strix Halo (gfx1151) kernels
  • Recent llama.cpp builds (eg b5863 from 2005-07-10)

Just to get a ballpark on the hardware:

  • ~215 GB/s max GPU MBW out of a 256 GB/s theoretical (256-bit 8000 MT/s)
  • theoretical 59 FP16 TFLOPS (VPOD/WMMA) on RDNA 3.5 (gfx11); effective is much lower

Results

Prompt Processing (pp) Performance

Model Name Architecture Weights (B) Active (B) Backend Flags pp512 tg128 Memory (Max MiB)
Llama 2 7B Q4_0 Llama 2 7 7 Vulkan 998.0 46.5 4237
Llama 2 7B Q4_K_M Llama 2 7 7 HIP hipBLASLt 906.1 40.8 4720
Shisa V2 8B i1-Q4_K_M Llama 3 8 8 HIP hipBLASLt 878.2 37.2 5308
Qwen 3 30B-A3B UD-Q4_K_XL Qwen 3 MoE 30 3 Vulkan fa=1 604.8 66.3 17527
Mistral Small 3.1 UD-Q4_K_XL Mistral 3 24 24 HIP hipBLASLt 316.9 13.6 14638
Hunyuan-A13B UD-Q6_K_XL Hunyuan MoE 80 13 Vulkan fa=1 270.5 17.1 68785
Llama 4 Scout UD-Q4_K_XL Llama 4 MoE 109 17 HIP hipBLASLt 264.1 17.2 59720
Qwen 3 32B Q8_0 Qwen 3 32 32 HIP hipBLASLt 226.1 6.4 33683
Shisa V2 70B i1-Q4_K_M Llama 3 70 70 HIP rocWMMA 94.7 4.5 41522
dots1 UD-Q4_K_XL dots1 MoE 142 14 Vulkan fa=1 b=256 63.1 20.6 84077

Text Generation (tg) Performance

Model Name Architecture Weights (B) Active (B) Backend Flags pp512 tg128 Memory (Max MiB)
Qwen 3 30B-A3B UD-Q4_K_XL Qwen 3 MoE 30 3 Vulkan b=256 591.1 72.0 17377
Llama 2 7B Q4_K_M Llama 2 7 7 Vulkan fa=1 620.9 47.9 4463
Llama 2 7B Q4_0 Llama 2 7 7 Vulkan fa=1 1014.1 45.8 4219
Shisa V2 8B i1-Q4_K_M Llama 3 8 8 Vulkan fa=1 614.2 42.0 5333
dots1 UD-Q4_K_XL dots1 MoE 142 14 Vulkan fa=1 b=256 63.1 20.6 84077
Llama 4 Scout UD-Q4_K_XL Llama 4 MoE 109 17 Vulkan fa=1 b=256 146.1 19.3 59917
Hunyuan-A13B UD-Q6_K_XL Hunyuan MoE 80 13 Vulkan fa=1 b=256 223.9 17.1 68608
Mistral Small 3.1 UD-Q4_K_XL Mistral 3 24 24 Vulkan fa=1 119.6 14.3 14540
Qwen 3 32B Q8_0 Qwen 3 32 32 Vulkan fa=1 101.8 6.4 33886
Shisa V2 70B i1-Q4_K_M Llama 3 70 70 Vulkan fa=1 26.4 5.0 41456

Testing Notes

The best overall backend and flags were chosen for each model family tested. You can see that often times the best backend for prefill vs token generation differ. Full results for each model (including the pp/tg graphs for different context lengths for all tested backend variations) are available for review in their respective folders as which backend is the best performing will depend on your exact use-case.

There’s a lot of performance still on the table when it comes to pp especially. Since these results should be close to optimal for when they were tested, I might add dates to the table (adding kernel, ROCm, and llama.cpp build#'s might be a bit much).

For additional discussion/reference:

13 Likes

I’m curious about this comment @lhl Can you expand a little more, please?

And, thank you! A great post and wonderfully helpful information. Much appreciated!

You can see from my mamf-finder and hgemm results that perf can be extremely low for different shapes: Strix Halo

But actually, if you take a look at this issue I’ve filed: [Issue]: gfx1151 rocBLAS/hipBLAS performance regression vs gfx1100 code path · Issue #4748 · ROCm/ROCm · GitHub

In some tests can see that using the gfx1100 kernels can be many times faster than the gfx1151 kernels:

gfx1100 rocBLAS has 2.5-6X the performance as gfx1151 rocBLAS
gfx1100 rocBLAS is 1.5-3X faster than gfx1151 hipBLASLt

That’s free real estate.

2 Likes

That’s very helpful. Thanks again @lhl

1 Like