Getting 20 t/s on dual Sparks using VLLM in tensor parallel mode over Infiniband with RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4.
Same workflow running over Ethernet was giving me 16 t/s.
Same physical port and cable.
Getting 20 t/s on dual Sparks using VLLM in tensor parallel mode over Infiniband with RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4.
Same workflow running over Ethernet was giving me 16 t/s.
Same physical port and cable.
Turned out GB10 is not yet optimized for FP4 quants, so AWQ gave me 25 t/s on the same model.
Also, 40 t/s on Minimax M2 in AWQ 4-bit is very usable for coding.
Wow, I was able to run GLM-4.6 in 4-bit AWQ on my dual Sparks and the performance was acceptable. 16 t/s is not fast by any measure, but usable. Prompt processing speeds were pretty decent too.
Could only fit 50K context. I guess if I optimized my memory footprint, I could ramp it up to 64K.
The latest llama.cpp improvements for Blackwell brought noticeable bump in performance on DGX Spark for gpt-oss:
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 | 2438.11 ± 13.72 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 | 57.81 ± 0.53 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d4096 | 2294.32 ± 12.61 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d4096 | 54.68 ± 0.52 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d8192 | 2149.21 ± 8.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d8192 | 51.75 ± 0.56 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d16384 | 1824.37 ± 8.93 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d16384 | 48.29 ± 0.21 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d32768 | 1415.53 ± 9.85 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d32768 | 41.42 ± 0.17 |
build: f5acfb2ff (7535)
Meanwhile, there was a big performance regression on my Strix Halo with ROCm. I finally solved that by using ROCm 6.4.4 from Fedora 43 packages instead of using nightly build from TheRock that worked just fine all this time.
Also, the most recent Fedora 43 update broke ROCm altogether - nothing worked until I rolled back to 6.17.8 kernel. Even 6.17.11 that worked before is borked - I suspect this is related to recent AMD GPU firmware changes.
ROCm 6.4.4 / Linux 6.17.8
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 | 1037.00 ± 3.48 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 | 51.20 ± 0.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d4096 | 842.89 ± 2.65 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d4096 | 48.03 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d8192 | 703.82 ± 2.11 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d8192 | 46.32 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d16384 | 522.96 ± 0.57 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d16384 | 44.02 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d32768 | 344.39 ± 0.91 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d32768 | 38.87 ± 0.01 |
build: f5acfb2ff (7535)
ROCm 7.11.0a20251222
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 | 558.11 ± 2.58 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 | 52.41 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d4096 | 499.19 ± 1.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d4096 | 48.91 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d8192 | 445.21 ± 1.10 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d8192 | 46.68 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d16384 | 363.47 ± 0.74 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d16384 | 43.07 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d32768 | 265.62 ± 1.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d32768 | 36.48 ± 0.01 |
build: f5acfb2ff (7535)
Lots of interesting information in this comment: Misc. bug: Performance regression using ROCm on Strix Halo · Issue #17917 · ggml-org/llama.cpp · GitHub
So, essentially, to get ROCm 7 performance back, one needs to set the runtime environment variable: ROCBLAS_USE_HIPBLASLT_BATCHED=0
I’m now getting identical performance to ROCm 6.4.4:
ROCm 7.11.0a20251222 after setting ROCBLAS_USE_HIPBLASLT_BATCHED=0:
ROCBLAS_USE_HIPBLASLT_BATCHED=0 build.rocm7/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 | 1035.20 ± 5.95 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 | 51.24 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d4096 | 841.99 ± 3.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d4096 | 48.04 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d8192 | 706.15 ± 0.70 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d8192 | 46.39 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d16384 | 523.87 ± 0.95 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d16384 | 44.06 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | pp2048 @ d32768 | 346.71 ± 1.32 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | tg32 @ d32768 | 38.91 ± 0.00 |
build: f5acfb2ff (7535)
Well, actually I just messed up my folders. The ROCm 7 performance is still degraded with latest llama.cpp builds. ROCBLAS_USE_HIPBLASLT_BATCHED=0 doesn’t make any difference
Back to 6.4.4 it is.
For me rocm 7.1 / 7.9 work, but regression with rocm 7.10+ …
but only on Quantized model not BF16/FP16. so yes make more sens that ROCBLAS_USE_HIPBLASLT_BATCHED that is not “use” in that case did not change that.
@Eugr thanks for all your work !!!