AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests

On and off over the past couple months I’ve been doing ML/AI/LLM testing on Strix Halo (Ryzen AI Max+ 395), specifically with the gfx1151 GPU, in my spare time.

I was given the go ahead a while back to mention that this has been on pre-production Framework Desktop hardware, but since I only just recently finished this latest set of sweeps, and with release I assume getting pretty close (nope, I don’t know anything, don’t ask, lol), I figure I’d share the most detailed/definitive LLM inference performance testing that’s been done on the Ryzen AI Max so far.

A few things that differentiate vs any prior testing that I’ve seen:

  • Run on the latest software - Linux 6.15.5+ w/ the latest linux-firmware, BIOS/EC, TheRock ROCm nightly releases w/ gfx1151 targeted kernels and also recent llama.cpp builds, built directly from source
  • Testing of multiple backends and flags including HIP w/ rocBLAS and hipBLASLt, Vulkan w/ 2^n batching tests for MoEs, multiple MoE and dense model architectures and quants
  • Full sweeps of pp (compute bound), tg (memory-bandwidth bound), and memory usage (w/ and w/o FA)

These tests of course use llama-bench so that they are repeatable and statistically valid as well (default 5 runs).

For those interested for information on specific models, the raw data and individual sweeps for each model are available here (in chart and graph form): https://github.com/lhl/strix-halo-testing/tree/main/llm-bench

For everyone else, here’s the topline results:

Strix Halo LLM Benchmark Results

All testing was done on pre-production Framework Desktop systems with an AMD Ryzen Max+ 395 (Strix Halo)/128GB LPDDR5x-8000 configuration. (Thanks Nirav, Alexandru, and co!)

Exact testing/system details are in the results folders, but roughly these are running:

  • Close to production BIOS/EC
  • Relatively up-to-date kernels: 6.15.5-arch1-1/6.15.6-arch1-1
  • Recent TheRock/ROCm-7.0 nightly builds with Strix Halo (gfx1151) kernels
  • Recent llama.cpp builds (eg b5863 from 2005-07-10)

Just to get a ballpark on the hardware:

  • ~215 GB/s max GPU MBW out of a 256 GB/s theoretical (256-bit 8000 MT/s)
  • theoretical 59 FP16 TFLOPS (VPOD/WMMA) on RDNA 3.5 (gfx11); effective is much lower

Results

Prompt Processing (pp) Performance

Model Name Architecture Weights (B) Active (B) Backend Flags pp512 tg128 Memory (Max MiB)
Llama 2 7B Q4_0 Llama 2 7 7 Vulkan 998.0 46.5 4237
Llama 2 7B Q4_K_M Llama 2 7 7 HIP hipBLASLt 906.1 40.8 4720
Shisa V2 8B i1-Q4_K_M Llama 3 8 8 HIP hipBLASLt 878.2 37.2 5308
Qwen 3 30B-A3B UD-Q4_K_XL Qwen 3 MoE 30 3 Vulkan fa=1 604.8 66.3 17527
Mistral Small 3.1 UD-Q4_K_XL Mistral 3 24 24 HIP hipBLASLt 316.9 13.6 14638
Hunyuan-A13B UD-Q6_K_XL Hunyuan MoE 80 13 Vulkan fa=1 270.5 17.1 68785
Llama 4 Scout UD-Q4_K_XL Llama 4 MoE 109 17 HIP hipBLASLt 264.1 17.2 59720
Qwen 3 32B Q8_0 Qwen 3 32 32 HIP hipBLASLt 226.1 6.4 33683
Shisa V2 70B i1-Q4_K_M Llama 3 70 70 HIP rocWMMA 94.7 4.5 41522
dots1 UD-Q4_K_XL dots1 MoE 142 14 Vulkan fa=1 b=256 63.1 20.6 84077

Text Generation (tg) Performance

Model Name Architecture Weights (B) Active (B) Backend Flags pp512 tg128 Memory (Max MiB)
Qwen 3 30B-A3B UD-Q4_K_XL Qwen 3 MoE 30 3 Vulkan b=256 591.1 72.0 17377
Llama 2 7B Q4_K_M Llama 2 7 7 Vulkan fa=1 620.9 47.9 4463
Llama 2 7B Q4_0 Llama 2 7 7 Vulkan fa=1 1014.1 45.8 4219
Shisa V2 8B i1-Q4_K_M Llama 3 8 8 Vulkan fa=1 614.2 42.0 5333
dots1 UD-Q4_K_XL dots1 MoE 142 14 Vulkan fa=1 b=256 63.1 20.6 84077
Llama 4 Scout UD-Q4_K_XL Llama 4 MoE 109 17 Vulkan fa=1 b=256 146.1 19.3 59917
Hunyuan-A13B UD-Q6_K_XL Hunyuan MoE 80 13 Vulkan fa=1 b=256 223.9 17.1 68608
Mistral Small 3.1 UD-Q4_K_XL Mistral 3 24 24 Vulkan fa=1 119.6 14.3 14540
Qwen 3 32B Q8_0 Qwen 3 32 32 Vulkan fa=1 101.8 6.4 33886
Shisa V2 70B i1-Q4_K_M Llama 3 70 70 Vulkan fa=1 26.4 5.0 41456

Testing Notes

The best overall backend and flags were chosen for each model family tested. You can see that often times the best backend for prefill vs token generation differ. Full results for each model (including the pp/tg graphs for different context lengths for all tested backend variations) are available for review in their respective folders as which backend is the best performing will depend on your exact use-case.

There’s a lot of performance still on the table when it comes to pp especially. Since these results should be close to optimal for when they were tested, I might add dates to the table (adding kernel, ROCm, and llama.cpp build#'s might be a bit much).

For additional discussion/reference:

16 Likes

I’m curious about this comment @lhl Can you expand a little more, please?

And, thank you! A great post and wonderfully helpful information. Much appreciated!

You can see from my mamf-finder and hgemm results that perf can be extremely low for different shapes: Strix Halo

But actually, if you take a look at this issue I’ve filed: [Issue]: gfx1151 rocBLAS/hipBLAS performance regression vs gfx1100 code path · Issue #4748 · ROCm/ROCm · GitHub

In some tests can see that using the gfx1100 kernels can be many times faster than the gfx1151 kernels:

gfx1100 rocBLAS has 2.5-6X the performance as gfx1151 rocBLAS
gfx1100 rocBLAS is 1.5-3X faster than gfx1151 hipBLASLt

That’s free real estate.

2 Likes

That’s very helpful. Thanks again @lhl

1 Like

This is incredibly good!

This looks promising that I have to consider framework desktop again.

using rocm7rc + WMMA FA: with llama.cpp

model size params backend ngl n_ubatch fa test t/s
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp1 45.96 ± 0.14
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp1 46.09 ± 0.04
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp2 57.97 ± 1.40
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp4 91.34 ± 2.48
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp8 129.47 ± 7.63
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp16 208.04 ± 4.68
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp32 244.28 ± 9.17
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp48 223.12 ± 12.43
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp64 315.89 ± 8.22
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp96 390.09 ± 6.75
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp128 451.03 ± 4.94
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp192 517.76 ± 2.36
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp256 597.15 ± 10.53
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp384 709.08 ± 7.95
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp512 775.60 ± 3.08
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp768 852.64 ± 4.50
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp1024 932.50 ± 6.08
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp1536 992.20 ± 1.94
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp2048 1020.32 ± 10.73
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp3072 939.97 ± 12.84
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp4096 962.18 ± 1.20
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 tg16 46.08 ± 0.01
gpt-oss 120B MXFP4 59.02 GiB 116.83 B ROCm 99 2048 1 pp512+tg64 270.98 ± 0.78
2 Likes

Is this faster than vulkan? Any vulkan test to compare?

more bench here:

Juste to note: the default ubatch is 512, you can have better result on long context with ubatch of 2048.

1 Like

No change for 7 over 6. Vulkan has more wins in the token generation side too. Probably version 8 before it equals or surpasses Vulkan. AMD are also discontinuing their Vulkan driver and supporting Mesa RADV for Linux. Hopefully this means we’ll get improvements for ROCm sooner.

Vulkan don’t like large BF16 model :wink:

Some more bench with Mistral Nemo (12.25 B)

  • vulkan with mesa 25.1.9 (fc42)
  • Rocm => 7rc + WMMA
model size backend threads type_k type_v test t/s
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp1 4.95 ± 0.00
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp1 4.95 ± 0.00
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp2 9.27 ± 0.01
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp4 18.25 ± 0.00
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp8 35.12 ± 0.14
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp16 66.50 ± 0.10
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp32 114.29 ± 0.07
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp48 151.03 ± 0.09
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp64 168.75 ± 0.04
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp128 176.05 ± 0.42
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp192 175.46 ± 0.13
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp256 179.09 ± 0.04
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp384 183.79 ± 0.24
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp512 181.87 ± 0.24
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp768 177.31 ± 0.20
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 pp1024 180.00 ± 0.04
Nemo BF16 22.81 GiB CPU 16 bf16 bf16 tg16 4.97 ± 0.00
model size backend ngl mmap test t/s
Nemo BF16 22.81 GiB Vulkan 999 0 pp1 9.10 ± 0.00
Nemo BF16 22.81 GiB Vulkan 999 0 pp1 9.11 ± 0.01
Nemo BF16 22.81 GiB Vulkan 999 0 pp2 17.31 ± 0.01
Nemo BF16 22.81 GiB Vulkan 999 0 pp4 33.07 ± 0.05
Nemo BF16 22.81 GiB Vulkan 999 0 pp8 60.65 ± 0.14
Nemo BF16 22.81 GiB Vulkan 999 0 pp16 60.51 ± 0.21
Nemo BF16 22.81 GiB Vulkan 999 0 pp32 120.04 ± 0.21
Nemo BF16 22.81 GiB Vulkan 999 0 pp48 138.00 ± 0.34
Nemo BF16 22.81 GiB Vulkan 999 0 pp64 188.94 ± 1.23
Nemo BF16 22.81 GiB Vulkan 999 0 pp128 216.68 ± 1.22
Nemo BF16 22.81 GiB Vulkan 999 0 pp192 180.85 ± 2.07
Nemo BF16 22.81 GiB Vulkan 999 0 pp256 198.99 ± 1.43
Nemo BF16 22.81 GiB Vulkan 999 0 pp384 226.39 ± 1.15
Nemo BF16 22.81 GiB Vulkan 999 0 pp512 233.00 ± 1.73
Nemo BF16 22.81 GiB Vulkan 999 0 pp768 219.10 ± 1.23
Nemo BF16 22.81 GiB Vulkan 999 0 pp1024 222.80 ± 1.32
Nemo BF16 22.81 GiB Vulkan 999 0 pp1536 222.22 ± 0.06
Nemo BF16 22.81 GiB Vulkan 999 0 pp2048 217.11 ± 0.20
Nemo BF16 22.81 GiB Vulkan 999 0 pp3072 215.13 ± 0.65
Nemo BF16 22.81 GiB Vulkan 999 0 pp4096 211.14 ± 0.20
Nemo BF16 22.81 GiB Vulkan 999 0 tg16 9.11 ± 0.00
model size backend ngl n_ubatch fa mmap test t/s
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp1 9.05 ± 0.00
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp1 9.06 ± 0.00
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp2 17.44 ± 0.01
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp4 33.31 ± 0.02
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp8 60.42 ± 0.12
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp16 60.37 ± 0.05
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp32 120.18 ± 0.40
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp48 137.04 ± 0.30
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp64 188.16 ± 0.89
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp128 217.18 ± 0.89
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp192 181.06 ± 1.79
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp256 200.64 ± 1.98
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp384 231.35 ± 1.20
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp512 240.47 ± 0.46
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp768 254.99 ± 0.13
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp1024 267.15 ± 0.42
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp1536 270.17 ± 0.37
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp2048 270.73 ± 0.15
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp3072 263.42 ± 0.55
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 pp4096 259.99 ± 0.67
Nemo BF16 22.81 GiB Vulkan 999 4096 1 0 tg16 9.06 ± 0.00
model size backend ngl n_ubatch mmap test t/s
Nemo BF16 22.81 GiB ROCm WMMA 999 4096 0 pp1 9.12 ± 0.01
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp1 9.12 ± 0.00
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp2 18.34 ± 0.01
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp4 26.14 ± 0.03
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp8 51.41 ± 0.07
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp16 100.74 ± 0.14
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp32 188.17 ± 0.29
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp48 254.40 ± 1.25
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp64 289.35 ± 1.97
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp128 502.12 ± 2.10
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp192 438.20 ± 0.86
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp256 554.11 ± 1.67
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp384 674.47 ± 3.94
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp512 750.69 ± 2.51
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp768 676.79 ± 1.31
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp1024 701.14 ± 1.74
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp1536 657.02 ± 1.08
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp2048 608.92 ± 1.32
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp3072 571.98 ± 0.76
Nemo BF16 22.81 GiB ROCm 999 4096 0 pp4096 536.20 ± 3.82
Nemo BF16 22.81 GiB ROCm 999 4096 0 tg16 9.12 ± 0.00
model size backend ngl n_ubatch fa mmap test t/s
Nemo BF16 22.81 GiB ROCm WMMA 999 4096 1 0 pp1 9.12 ± 0.01
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp1 9.12 ± 0.01
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp2 18.16 ± 0.01
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp4 25.68 ± 0.02
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp8 50.07 ± 0.04
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp16 100.46 ± 0.25
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp32 188.77 ± 0.14
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp48 250.75 ± 1.11
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp64 285.20 ± 1.65
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp128 497.93 ± 3.36
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp192 440.71 ± 1.45
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp256 568.84 ± 0.66
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp384 706.95 ± 1.18
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp512 833.50 ± 3.12
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp768 758.99 ± 1.01
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp1024 857.05 ± 4.48
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp1536 852.12 ± 1.30
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp2048 856.53 ± 2.11
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp3072 789.06 ± 1.45
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 pp4096 739.43 ± 0.81
Nemo BF16 22.81 GiB ROCm 999 4096 1 0 tg16 9.13 ± 0.00
2 Likes

Hi! Thanks a lot. Do you have a link for tutorial how u got it running? With rocm 7 and Vulkan?

Thanks in advance!

For those who looked at the bench published there: https://www.phoronix.com/review/amd-rocm-7-strix-halo/3 as report in comment there is something wrong with the bench with rocm. Ther result I get:

Qwen_Qwen3-8B (params = 8,19 B)

therock 7.0rc, on “toolbox”: https://github.com/kyuz0/amd-strix-halo-toolboxes: “rocm-7rc-rocwmma”
n_ubatch=2048 , fa=1, mmap=0
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: d304f459 (6502)

model size test HIPBLASLT OFF t/s HIPBLASLT ON t/s
qwen3 8B BF16 15.26 GiB pp1 12.93 ± 0.02 12.95 ± 0.02
qwen3 8B BF16 15.26 GiB pp2 26.01 ± 0.03 26.02 ± 0.02
qwen3 8B BF16 15.26 GiB pp4 28.40 ± 0.04 20.37 ± 0.13
qwen3 8B BF16 15.26 GiB pp6 41.74 ± 0.04 29.80 ± 0.12
qwen3 8B BF16 15.26 GiB pp8 55.33 ± 0.08 39.67 ± 0.11
qwen3 8B BF16 15.26 GiB pp16 112.70 ± 0.24 81.55 ± 0.24
qwen3 8B BF16 15.26 GiB pp24 165.72 ± 0.30 120.44 ± 0.41
qwen3 8B BF16 15.26 GiB pp32 215.53 ± 0.44 159.88 ± 0.85
qwen3 8B BF16 15.26 GiB pp48 298.57 ± 0.94 223.21 ± 0.39
qwen3 8B BF16 15.26 GiB pp64 362.05 ± 2.13 290.74 ± 1.04
qwen3 8B BF16 15.26 GiB pp96 442.30 ± 1.39 407.15 ± 2.10
qwen3 8B BF16 15.26 GiB pp128 571.41 ± 2.80 388.00 ± 1.34
qwen3 8B BF16 15.26 GiB pp192 526.04 ± 1.16 518.43 ± 2.56
qwen3 8B BF16 15.26 GiB pp256 665.36 ± 1.33 570.58 ± 2.91
qwen3 8B BF16 15.26 GiB pp384 884.04 ± 3.37 844.08 ± 4.72
qwen3 8B BF16 15.26 GiB pp512 1085.20 ± 2.81 1040.50 ± 3.02
qwen3 8B BF16 15.26 GiB pp768 1050.38 ± 6.90 1210.20 ± 1.64
qwen3 8B BF16 15.26 GiB pp1024 1132.21 ± 0.96 1116.82 ± 2.52
qwen3 8B BF16 15.26 GiB pp1536 1149.04 ± 0.78 1269.52 ± 1.70
qwen3 8B BF16 15.26 GiB pp2048 1143.40 ± 0.89 1213.20 ± 1.84
qwen3 8B BF16 15.26 GiB pp3072 1037.59 ± 0.66 1081.71 ± 2.06
qwen3 8B BF16 15.26 GiB pp4096 965.97 ± 2.21 1033.37 ± 1.18
qwen3 8B BF16 15.26 GiB tg16 12.93 ± 0.01 12.94 ± 0.00
qwen3 8B BF16 15.26 GiB pp512+tg64 105.00 ± 0.01 104.66 ± 0.03
qwen3 8B F16 15.26 GiB pp1 13.00 ± 0.01 12.98 ± 0.01
qwen3 8B F16 15.26 GiB pp2 26.07 ± 0.01 26.04 ± 0.03
qwen3 8B F16 15.26 GiB pp4 49.20 ± 0.03 49.20 ± 0.06
qwen3 8B F16 15.26 GiB pp6 41.93 ± 0.04 35.49 ± 0.08
qwen3 8B F16 15.26 GiB pp8 55.68 ± 0.01 46.91 ± 0.19
qwen3 8B F16 15.26 GiB pp16 112.61 ± 0.29 94.67 ± 0.31
qwen3 8B F16 15.26 GiB pp24 166.39 ± 0.39 141.30 ± 0.76
qwen3 8B F16 15.26 GiB pp32 216.75 ± 0.73 187.52 ± 0.78
qwen3 8B F16 15.26 GiB pp48 299.06 ± 0.82 263.33 ± 0.96
qwen3 8B F16 15.26 GiB pp64 362.20 ± 1.82 344.77 ± 0.97
qwen3 8B F16 15.26 GiB pp96 505.39 ± 2.44 475.28 ± 2.43
qwen3 8B F16 15.26 GiB pp128 621.02 ± 3.09 469.88 ± 2.35
qwen3 8B F16 15.26 GiB pp192 736.27 ± 3.05 592.13 ± 2.54
qwen3 8B F16 15.26 GiB pp256 879.50 ± 3.10 591.46 ± 3.75
qwen3 8B F16 15.26 GiB pp384 876.07 ± 2.75 937.11 ± 8.14
qwen3 8B F16 15.26 GiB pp512 1063.39 ± 3.14 1159.14 ± 3.20
qwen3 8B F16 15.26 GiB pp768 990.83 ± 2.54 1222.91 ± 1.98
qwen3 8B F16 15.26 GiB pp1024 1123.21 ± 2.67 1158.03 ± 4.19
qwen3 8B F16 15.26 GiB pp1536 1130.37 ± 2.98 1257.22 ± 1.80
qwen3 8B F16 15.26 GiB pp2048 1134.58 ± 2.64 1200.06 ± 1.42
qwen3 8B F16 15.26 GiB pp3072 1029.10 ± 1.27 1084.31 ± 1.60
qwen3 8B F16 15.26 GiB pp4096 951.39 ± 1.43 1026.12 ± 1.27
qwen3 8B F16 15.26 GiB tg16 13.00 ± 0.00 12.98 ± 0.00
qwen3 8B F16 15.26 GiB pp512+tg64 105.19 ± 0.03 105.93 ± 0.03
qwen3 8B Q8_0 8.11 GiB pp1 25.55 ± 0.01 25.52 ± 0.01
qwen3 8B Q8_0 8.11 GiB pp2 50.09 ± 0.02 50.15 ± 0.02
qwen3 8B Q8_0 8.11 GiB pp4 94.66 ± 0.07 94.73 ± 0.04
qwen3 8B Q8_0 8.11 GiB pp6 131.90 ± 0.12 131.79 ± 0.07
qwen3 8B Q8_0 8.11 GiB pp8 171.87 ± 0.21 171.79 ± 0.17
qwen3 8B Q8_0 8.11 GiB pp16 331.09 ± 0.44 331.19 ± 0.44
qwen3 8B Q8_0 8.11 GiB pp24 452.01 ± 0.78 451.89 ± 0.81
qwen3 8B Q8_0 8.11 GiB pp32 546.79 ± 0.40 547.36 ± 0.30
qwen3 8B Q8_0 8.11 GiB pp48 636.67 ± 0.80 637.28 ± 0.86
qwen3 8B Q8_0 8.11 GiB pp64 226.32 ± 0.55 219.86 ± 0.68
qwen3 8B Q8_0 8.11 GiB pp96 321.10 ± 0.66 314.47 ± 0.41
qwen3 8B Q8_0 8.11 GiB pp128 414.82 ± 0.88 332.79 ± 1.33
qwen3 8B Q8_0 8.11 GiB pp192 479.20 ± 2.61 417.33 ± 4.13
qwen3 8B Q8_0 8.11 GiB pp256 596.42 ± 2.24 432.78 ± 1.38
qwen3 8B Q8_0 8.11 GiB pp384 624.29 ± 4.09 690.14 ± 2.43
qwen3 8B Q8_0 8.11 GiB pp512 797.25 ± 2.31 898.02 ± 2.00
qwen3 8B Q8_0 8.11 GiB pp768 819.61 ± 1.67 1026.54 ± 2.85
qwen3 8B Q8_0 8.11 GiB pp1024 944.47 ± 2.79 1011.68 ± 2.15
qwen3 8B Q8_0 8.11 GiB pp1536 970.48 ± 3.37 1131.98 ± 0.92
qwen3 8B Q8_0 8.11 GiB pp2048 994.78 ± 2.03 1089.52 ± 1.08
qwen3 8B Q8_0 8.11 GiB pp3072 897.04 ± 1.53 984.06 ± 0.83
qwen3 8B Q8_0 8.11 GiB pp4096 846.70 ± 0.89 940.87 ± 1.67
qwen3 8B Q8_0 8.11 GiB tg16 25.55 ± 0.00 25.55 ± 0.00
qwen3 8B Q8_0 8.11 GiB pp512+tg64 179.15 ± 0.09 183.12 ± 0.05
qwen3 8B Q6_K 6.54 GiB pp1 30.36 ± 0.03 30.30 ± 0.03
qwen3 8B Q6_K 6.54 GiB pp2 59.28 ± 0.04 59.27 ± 0.06
qwen3 8B Q6_K 6.54 GiB pp4 109.74 ± 0.23 109.72 ± 0.24
qwen3 8B Q6_K 6.54 GiB pp6 147.07 ± 0.10 146.93 ± 0.08
qwen3 8B Q6_K 6.54 GiB pp8 174.82 ± 0.11 174.76 ± 0.12
qwen3 8B Q6_K 6.54 GiB pp16 326.67 ± 0.34 326.77 ± 0.50
qwen3 8B Q6_K 6.54 GiB pp24 406.59 ± 0.72 407.66 ± 0.57
qwen3 8B Q6_K 6.54 GiB pp32 433.06 ± 0.65 432.86 ± 0.53
qwen3 8B Q6_K 6.54 GiB pp48 486.84 ± 0.67 486.43 ± 0.99
qwen3 8B Q6_K 6.54 GiB pp64 236.55 ± 0.60 230.69 ± 0.69
qwen3 8B Q6_K 6.54 GiB pp96 335.34 ± 0.26 327.46 ± 0.66
qwen3 8B Q6_K 6.54 GiB pp128 430.09 ± 1.09 341.29 ± 1.52
qwen3 8B Q6_K 6.54 GiB pp192 498.83 ± 2.61 432.89 ± 1.25
qwen3 8B Q6_K 6.54 GiB pp256 627.50 ± 1.53 438.21 ± 2.41
qwen3 8B Q6_K 6.54 GiB pp384 628.97 ± 3.51 720.54 ± 3.39
qwen3 8B Q6_K 6.54 GiB pp512 809.68 ± 2.35 919.29 ± 2.92
qwen3 8B Q6_K 6.54 GiB pp768 829.90 ± 2.82 1046.32 ± 1.71
qwen3 8B Q6_K 6.54 GiB pp1024 954.67 ± 1.05 1025.66 ± 1.80
qwen3 8B Q6_K 6.54 GiB pp1536 978.48 ± 2.23 1139.07 ± 1.35
qwen3 8B Q6_K 6.54 GiB pp2048 987.67 ± 1.64 1103.25 ± 0.64
qwen3 8B Q6_K 6.54 GiB pp3072 905.59 ± 3.98 995.22 ± 1.31
qwen3 8B Q6_K 6.54 GiB pp4096 849.35 ± 1.31 957.84 ± 0.85
qwen3 8B Q6_K 6.54 GiB tg16 30.38 ± 0.01 30.30 ± 0.00
qwen3 8B Q6_K 6.54 GiB pp512+tg64 205.66 ± 0.17 210.31 ± 0.16
qwen3 8B Q5_K_M 5.80 GiB pp1 33.41 ± 0.07 33.45 ± 0.08
qwen3 8B Q5_K_M 5.80 GiB pp2 60.18 ± 0.05 60.28 ± 0.06
qwen3 8B Q5_K_M 5.80 GiB pp4 98.72 ± 0.09 98.82 ± 0.09
qwen3 8B Q5_K_M 5.80 GiB pp6 117.95 ± 0.09 117.70 ± 0.04
qwen3 8B Q5_K_M 5.80 GiB pp8 133.76 ± 0.24 133.63 ± 0.09
qwen3 8B Q5_K_M 5.80 GiB pp16 361.90 ± 0.44 361.43 ± 0.47
qwen3 8B Q5_K_M 5.80 GiB pp24 452.90 ± 0.41 452.23 ± 0.28
qwen3 8B Q5_K_M 5.80 GiB pp32 469.04 ± 0.62 468.04 ± 0.22
qwen3 8B Q5_K_M 5.80 GiB pp48 332.68 ± 0.64 332.23 ± 0.44
qwen3 8B Q5_K_M 5.80 GiB pp64 232.55 ± 1.14 224.87 ± 0.30
qwen3 8B Q5_K_M 5.80 GiB pp96 330.34 ± 1.25 322.97 ± 0.67
qwen3 8B Q5_K_M 5.80 GiB pp128 423.84 ± 0.72 348.91 ± 1.43
qwen3 8B Q5_K_M 5.80 GiB pp192 485.42 ± 2.80 431.48 ± 3.15
qwen3 8B Q5_K_M 5.80 GiB pp256 617.08 ± 2.59 437.79 ± 2.02
qwen3 8B Q5_K_M 5.80 GiB pp384 639.67 ± 2.64 706.98 ± 0.91
qwen3 8B Q5_K_M 5.80 GiB pp512 810.94 ± 2.23 911.49 ± 2.45
qwen3 8B Q5_K_M 5.80 GiB pp768 827.59 ± 2.78 1027.47 ± 2.00
qwen3 8B Q5_K_M 5.80 GiB pp1024 961.61 ± 2.44 1026.27 ± 3.06
qwen3 8B Q5_K_M 5.80 GiB pp1536 974.74 ± 2.51 1122.00 ± 1.15
qwen3 8B Q5_K_M 5.80 GiB pp2048 987.61 ± 1.28 1095.68 ± 1.27
qwen3 8B Q5_K_M 5.80 GiB pp3072 894.33 ± 2.65 992.42 ± 0.88
qwen3 8B Q5_K_M 5.80 GiB pp4096 856.71 ± 0.84 953.10 ± 1.15
qwen3 8B Q5_K_M 5.80 GiB tg16 33.49 ± 0.01 33.49 ± 0.01
qwen3 8B Q5_K_M 5.80 GiB pp512+tg64 220.58 ± 0.18 226.53 ± 0.14

I have been bringing up docs in an AI section of the Strix Halo HomeLab wiki: AI-Capabilities-Overview – Strix Halo HomeLab (and when I get less busy will be updating all my other docs to point to that and my GitHub - lhl/strix-halo-testing repo as the latest update “sources of truth”)

Vulkan is easy, just follow the llama build instructions. I made a doc a while back on proper compiles for llama.cpp w/ ROCm: https://strixhalo-homelab.d7.wtf/AI/llamacpp-with-ROCm

2 Likes

Thank you - I’ll give it a try

model size params backend ngl n_ubatch fa mmap test t/s
qwen3 8B BF16 15.26 GiB 8.19 B ROCm 7.0.1 999 4096 1 0 pp512 325.95 ± 0.22
qwen3 8B BF16 15.26 GiB 8.19 B ROCm 6.4.4 999 4096 1 0 pp512 1132.26 ± 2.42
1 Like

That’s quite a difference. What numbers are you getting with Vulkan backend?
Also, what is your token generation speed?

@kyuz0 did a great job to compare many case :

Only there is not bench with new officiel rocm7.0.1. There is some review/bench (like https://www.phoronix.com/review/amd-rocm-7-strix-halo/3 ) that use the new official rocm7 and show realy bad result. What I didn’t know if there where a probleme with llama.cpp build or with rocm. What I get is same result on rocm7 so the “probleme” is with the official rocm7.0.1.

Like kyuz0 say if you want good perf with llama.cpp use therock build or the last 6.4.4 serie.

note: The 6.4.3 did not have bad result but may have more stability bug.

2 Likes