DGX Spark vs. Strix Halo - Initial Impressions

Getting 20 t/s on dual Sparks using VLLM in tensor parallel mode over Infiniband with RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4.

Same workflow running over Ethernet was giving me 16 t/s.

Same physical port and cable.

2 Likes

Turned out GB10 is not yet optimized for FP4 quants, so AWQ gave me 25 t/s on the same model.

Also, 40 t/s on Minimax M2 in AWQ 4-bit is very usable for coding.

Wow, I was able to run GLM-4.6 in 4-bit AWQ on my dual Sparks and the performance was acceptable. 16 t/s is not fast by any measure, but usable. Prompt processing speeds were pretty decent too.

Could only fit 50K context. I guess if I optimized my memory footprint, I could ramp it up to 64K.

The latest llama.cpp improvements for Blackwell brought noticeable bump in performance on DGX Spark for gpt-oss:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 2438.11 ± 13.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 57.81 ± 0.53
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 2294.32 ± 12.61
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 54.68 ± 0.52
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 2149.21 ± 8.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 51.75 ± 0.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1824.37 ± 8.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 48.29 ± 0.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1415.53 ± 9.85
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 41.42 ± 0.17

build: f5acfb2ff (7535)

Meanwhile, there was a big performance regression on my Strix Halo with ROCm. I finally solved that by using ROCm 6.4.4 from Fedora 43 packages instead of using nightly build from TheRock that worked just fine all this time.

Also, the most recent Fedora 43 update broke ROCm altogether - nothing worked until I rolled back to 6.17.8 kernel. Even 6.17.11 that worked before is borked - I suspect this is related to recent AMD GPU firmware changes.

ROCm 6.4.4 / Linux 6.17.8

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 1037.00 ± 3.48
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 51.20 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 842.89 ± 2.65
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 48.03 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 703.82 ± 2.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 46.32 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 522.96 ± 0.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 44.02 ± 0.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 344.39 ± 0.91
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 38.87 ± 0.01

build: f5acfb2ff (7535)

ROCm 7.11.0a20251222

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 558.11 ± 2.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 52.41 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 499.19 ± 1.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 48.91 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 445.21 ± 1.10
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 46.68 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 363.47 ± 0.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 43.07 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 265.62 ± 1.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 36.48 ± 0.01

build: f5acfb2ff (7535)

2 Likes

Lots of interesting information in this comment: Misc. bug: Performance regression using ROCm on Strix Halo · Issue #17917 · ggml-org/llama.cpp · GitHub

So, essentially, to get ROCm 7 performance back, one needs to set the runtime environment variable: ROCBLAS_USE_HIPBLASLT_BATCHED=0

I’m now getting identical performance to ROCm 6.4.4:

ROCm 7.11.0a20251222 after setting ROCBLAS_USE_HIPBLASLT_BATCHED=0:

ROCBLAS_USE_HIPBLASLT_BATCHED=0 build.rocm7/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 1035.20 ± 5.95
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 51.24 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 841.99 ± 3.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 48.04 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 706.15 ± 0.70
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 46.39 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 523.87 ± 0.95
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 44.06 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 346.71 ± 1.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 38.91 ± 0.00

build: f5acfb2ff (7535)

1 Like

Well, actually I just messed up my folders. The ROCm 7 performance is still degraded with latest llama.cpp builds. ROCBLAS_USE_HIPBLASLT_BATCHED=0 doesn’t make any difference :frowning: Back to 6.4.4 it is.

For me rocm 7.1 / 7.9 work, but regression with rocm 7.10+ …
but only on Quantized model not BF16/FP16. so yes make more sens that ROCBLAS_USE_HIPBLASLT_BATCHED that is not “use” in that case did not change that.

@Eugr thanks for all your work !!!