DGX Spark vs. Strix Halo - Initial Impressions

Eugr · November 26, 2025, 1:05am

Getting 20 t/s on dual Sparks using VLLM in tensor parallel mode over Infiniband with RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4.

Same workflow running over Ethernet was giving me 16 t/s.

Same physical port and cable.

Eugr · December 1, 2025, 6:15pm

Turned out GB10 is not yet optimized for FP4 quants, so AWQ gave me 25 t/s on the same model.

Also, 40 t/s on Minimax M2 in AWQ 4-bit is very usable for coding.

Eugr · December 7, 2025, 7:45am

Wow, I was able to run GLM-4.6 in 4-bit AWQ on my dual Sparks and the performance was acceptable. 16 t/s is not fast by any measure, but usable. Prompt processing speeds were pretty decent too.

Could only fit 50K context. I guess if I optimized my memory footprint, I could ramp it up to 64K.

Eugr · December 25, 2025, 8:05am

The latest llama.cpp improvements for Blackwell brought noticeable bump in performance on DGX Spark for gpt-oss:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048	2438.11 ± 13.72
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32	57.81 ± 0.53
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d4096	2294.32 ± 12.61
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d4096	54.68 ± 0.52
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d8192	2149.21 ± 8.88
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d8192	51.75 ± 0.56
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d16384	1824.37 ± 8.93
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d16384	48.29 ± 0.21
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d32768	1415.53 ± 9.85
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d32768	41.42 ± 0.17

build: f5acfb2ff (7535)

Meanwhile, there was a big performance regression on my Strix Halo with ROCm. I finally solved that by using ROCm 6.4.4 from Fedora 43 packages instead of using nightly build from TheRock that worked just fine all this time.

Also, the most recent Fedora 43 update broke ROCm altogether - nothing worked until I rolled back to 6.17.8 kernel. Even 6.17.11 that worked before is borked - I suspect this is related to recent AMD GPU firmware changes.

ROCm 6.4.4 / Linux 6.17.8

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	1037.00 ± 3.48
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	51.20 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	842.89 ± 2.65
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	48.03 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	703.82 ± 2.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	46.32 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	522.96 ± 0.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	44.02 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	344.39 ± 0.91
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	38.87 ± 0.01

build: f5acfb2ff (7535)

ROCm 7.11.0a20251222

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	558.11 ± 2.58
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	52.41 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	499.19 ± 1.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	48.91 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	445.21 ± 1.10
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	46.68 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	363.47 ± 0.74
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	43.07 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	265.62 ± 1.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	36.48 ± 0.01

build: f5acfb2ff (7535)

Eugr · December 25, 2025, 10:37pm

Lots of interesting information in this comment: Misc. bug: Performance regression using ROCm on Strix Halo · Issue #17917 · ggml-org/llama.cpp · GitHub

So, essentially, to get ROCm 7 performance back, one needs to set the runtime environment variable: ROCBLAS_USE_HIPBLASLT_BATCHED=0

I’m now getting identical performance to ROCm 6.4.4:

ROCm 7.11.0a20251222 after setting ROCBLAS_USE_HIPBLASLT_BATCHED=0:

ROCBLAS_USE_HIPBLASLT_BATCHED=0 build.rocm7/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	1035.20 ± 5.95
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	51.24 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	841.99 ± 3.28
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	48.04 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	706.15 ± 0.70
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	46.39 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	523.87 ± 0.95
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	44.06 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	346.71 ± 1.32
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	38.91 ± 0.00

build: f5acfb2ff (7535)

Eugr · December 26, 2025, 7:45am

Well, actually I just messed up my folders. The ROCm 7 performance is still degraded with latest llama.cpp builds. ROCBLAS_USE_HIPBLASLT_BATCHED=0 doesn’t make any difference Back to 6.4.4 it is.

Djip · December 26, 2025, 10:16am

For me rocm 7.1 / 7.9 work, but regression with rocm 7.10+ …
but only on Quantized model not BF16/FP16. so yes make more sens that ROCBLAS_USE_HIPBLASLT_BATCHED that is not “use” in that case did not change that.

@Eugr thanks for all your work !!!

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	14957	September 29, 2025
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	6978	February 2, 2026
Ryzen AI "Max" -- not so much? Framework Desktop	23	2022	December 2, 2025
[HOW-TO] Compiling VLLM from source on Strix Halo Framework Desktop ai	59	4769	January 7, 2026
[TRACKING] Request: verify dGPU support Framework Desktop compatibility	209	10315	February 4, 2026

DGX Spark vs. Strix Halo - Initial Impressions

Related topics