DGX Spark vs. Strix Halo - Initial Impressions

Heh, turned out VLLM compiles and runs just fine without NVidia-provided container. Just needed to set an environment variable specifying the arch.

1 Like

In other news, there is some new activity in amd-dev branch of vllm project, so hopefully some improvements are coming in 0.11.1 release. But amdsmi python package is still crashing, so there is that.

1 Like

what strang is that the one from fedora 42 work.

$ amd-smi version
AMDSMI Tool: 24.7.1+unknown | AMDSMI Library version: 24.7.1.0 | ROCm version: N/A

no, not this one. amdsmi Python module, and only on cleanup. amd-smi commandline tool works.

look there is someting wrong with amd-smi for rocm-7+:

  • fedora 42 / rocm 6.3:
$ amd-smi list
GPU: 0
    BDF: 0000:c1:00.0
    UUID: 00ff1586-0000-1000-8000-000000000000
    KFD_ID: 29672
    NODE_ID: 1
    PARTITION_ID: 0
  • fedora 43 / rocm 6.4:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
    BDF: 0000:c1:00.0
    UUID: 00ff1586-0000-1000-8000-000000000000
    KFD_ID: 29672
    NODE_ID: 1
    PARTITION_ID: 0
  • fedora 44 / rocm 7.0:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
    BDF: N/A
    UUID: N/A
    KFD_ID: 29672
    NODE_ID: 1
    PARTITION_ID: 0

Note: all done on toolbox runing on silverbue 42 …

Strix Halo on Framework MB:

  • FA: on
  • mmap: off
  • GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
  • ngl: 999
  • n_ubatch=4096
  • backend: rocm
model size params test t/s
Mistral-Small-2506 43.91 GiB 23.57 B pp1 4.69 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp1 4.69 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp2 9.19 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp3 11.23 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp4 12.86 ± 0.01
Mistral-Small-2506 43.91 GiB 23.57 B pp8 25.37 ± 0.04
Mistral-Small-2506 43.91 GiB 23.57 B pp12 37.53 ± 0.06
Mistral-Small-2506 43.91 GiB 23.57 B pp16 49.17 ± 0.08
Mistral-Small-2506 43.91 GiB 23.57 B pp24 70.87 ± 0.15
Mistral-Small-2506 43.91 GiB 23.57 B pp32 89.94 ± 0.45
Mistral-Small-2506 43.91 GiB 23.57 B pp48 122.01 ± 0.61
Mistral-Small-2506 43.91 GiB 23.57 B pp64 145.84 ± 0.60
Mistral-Small-2506 43.91 GiB 23.57 B pp96 207.52 ± 0.55
Mistral-Small-2506 43.91 GiB 23.57 B pp128 269.40 ± 0.95
Mistral-Small-2506 43.91 GiB 23.57 B pp192 229.28 ± 0.15
Mistral-Small-2506 43.91 GiB 23.57 B pp256 291.95 ± 0.70
Mistral-Small-2506 43.91 GiB 23.57 B pp384 358.48 ± 0.89
Mistral-Small-2506 43.91 GiB 23.57 B pp512 418.56 ± 0.65
Mistral-Small-2506 43.91 GiB 23.57 B pp768 401.40 ± 1.40
Mistral-Small-2506 43.91 GiB 23.57 B pp1024 438.28 ± 1.35
Mistral-Small-2506 43.91 GiB 23.57 B pp1536 439.35 ± 0.80
Mistral-Small-2506 43.91 GiB 23.57 B pp2048 438.40 ± 1.04
Mistral-Small-2506 43.91 GiB 23.57 B pp3072 432.32 ± 0.48
Mistral-Small-2506 43.91 GiB 23.57 B pp4096 423.00 ± 0.47
Mistral-Small-2506 43.91 GiB 23.57 B tg16 4.69 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp512+tg64 38.69 ± 0.01

The user in toolbox is not a member of the required groups.

:~$ ll /dev/kfd
crw-rw-rw-. 1 root render 235, 0 25 oct.  11:21 /dev/kfd
:~$ ll /dev/dri/renderD128 
crw-rw-rw-. 1 root render 226, 128 25 oct.  11:21 /dev/dri/renderD128

it is not needed with fedora, there is rw for all user. So the chech is “wrong”. I never need it on this OS. (may be needed on server / coreOS release?)
and rocm work fine without.

⬢ [zzzzzz@toolbx ~]$ getfacl /dev/dri/card1 
getfacl : suppression du premier « / » des noms de chemins absolus
# file: dev/dri/card1
# owner: nobody
# group: nobody
user::rw-
user:4294967295:rw-
group::rw-
mask::rw-
other::---

and have user ACL right too. on cardN.

I try to add groups:

sudo usermod -a -G video,render $LOGNAME

But it did not work , did not add user to the groups.

Edit: find how to add user in this groups on host but not how to have them in toolbox. I have to look what is th “good” way for that (and after if is is realy needed…)

BTW, ran some ComfyUI tests:

  • Default Flux.1 dev workflow:
    • On Spark: 34 seconds
    • On Strix Halo: 98 seconds
    • On 4090: 12 seconds

I need to find the time to see how to Install and Use/Bench ComfyUI :wink:

Instructions included in Spark documentation work for Strix Halo too (but you need to install pytorch from TheRock): Try NVIDIA NIM APIs :slight_smile:

1 Like

thanks.

for other you need to replace step3 with

python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/

I wanted to try it on iGPU of the fw16 but only rocm as been build not pytorch (look a bug)

# may work later for the 780M
python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/

Finally, installed Fedora 43 Server on DGX Spark.

Had to compile Linux kernel from NVidia repository to add a proper NIC driver and enable some GPU optimizations. Model loading now takes the same time as on Strix Halo (as opposed to 5x slower), but the generation performance stays the same as with stock DGX OS. So, a big win.

However, turned out there is something wrong with mmap on this platform. When using mmap, memory loading slows down significantly, like 1 minute 30 seconds to load gpt-oss-120b vs. 22 seconds without mmap. Or 8 minutes 44 seconds for qwen3-next on VLLM with mmap vs 1 minute 30 seconds without it.

1 Like

Decided to try MiniMax M2 on both. The only quant I could fit with any kind of usable context is Q3_K_XL from Unsloth - allows to run up to 64K context, but consumes pretty much all memory.

The difference in performance is even more noticeable than with gpt-oss-120b, although I have a feeling that it could be improved over time as ROCm support matures. Especially the performance degradation.

build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_MiniMax-M2-GGUF_UD-Q3_K_XL_MiniMax-M2-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

DGX Spark (stock OS):

model size params backend test t/s
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 892.63 ± 1.17
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 29.72 ± 0.04
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d4096 814.83 ± 1.29
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d4096 25.81 ± 0.07
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d8192 750.01 ± 2.47
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d8192 21.98 ± 0.06
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d16384 639.73 ± 0.73
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d16384 17.69 ± 0.03
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d32768 436.44 ± 12.49
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d32768 12.54 ± 0.11

build: c4abcb245 (7053)

Strix Halo:

model size params backend test t/s
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 286.33 ± 0.49
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 30.22 ± 0.07
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d4096 229.23 ± 0.47
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d4096 23.52 ± 0.01
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d8192 190.70 ± 0.26
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d8192 19.18 ± 0.01
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d16384 128.27 ± 0.52
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d16384 13.31 ± 0.02
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d32768 58.44 ± 0.33
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d32768 7.72 ± 0.01

build: 45c6ef730 (7058)

1 Like

Wow!

My post has to be 5 characters so …

Wow!!

I’d say the difference in PP performance makes Spark experience much more enjoyable, especially on long contexts. It’s also dead silent. RTX Pro 6000 would be even better speed wise, and probably is the better deal overall, but for my workloads slightly more VRAM is worth it. And the fact I can keep it at home and run 24/7 without making any noise :slight_smile:

I still have my Strix Halo, but that one is my personal one. I think I’ll install Proxmox, so it could join my cluster and be more useful during idle times.

1 Like

maybe a dumb question. but im running omarchy (which nirav seems to run to some extent as he has commits in omarchy), and I can’t get LM studio to find the gpu on the framework desktop. meanwhile if i use fedora + lm studio it works without any issue. i started a thread in this forum, but its been a week with 0 hits. so anyway. any chance you’ve used the FW desktop with arch?

With these reports, and the fantastic job you’ve done in this thread Eugr … Thank You! … The incessant perma-delays Framework is putting its US Mainboards customers thru has me one frustrated urge away from clicking the Buy button on the product page for the ASUS Ascent GX10 and cancelling my order for the FW mainboard.

1 Like

Experimenting with Dual Sparks connected at 200 Gbps.

Running llama.cpp with RPC backend. Not optimal as it uses a regular TCP/IP stack instead of ultra-low latency Infiniband, but still. I wonder how this would change if llama.cpp supported NCCL - I’m able to get 1-2 microsecond (!) latency measured by ib_send_lat vs ~0.9 ms reported by ping. That’s 900x difference!

Anyway, Qwen3-235B at Q4_K_XL quant fits on dual sparks with full context and room to spare (after all, I was able to fit Q3_K_XL on a single Spark, but with 64K context, I believe). I could go to Q6_K_XL.

build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-VL-235B-A22B-Instruct-GGUF_UD-Q4_K_XL_Qwen3-VL-235B-A22B-Instruct-UD-Q4_K_XL-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
model size params backend test t/s
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 545.20 ± 1.19
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 13.39 ± 0.08
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d4096 496.61 ± 0.44
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d4096 12.04 ± 0.05
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d8192 448.86 ± 1.04
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d8192 11.64 ± 0.11
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d16384 373.40 ± 0.44
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d16384 10.31 ± 0.06
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d32768 280.64 ± 0.57
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d32768 8.58 ± 0.04

build: 21d31e081 (7122)

2 Likes

Minimax M2 Q4_K_XL on dual Sparks:

build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_MiniMax-M2-GGUF_UD-Q4_K_XL_MiniMax-M2-UD-Q4_K_XL-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
model size params backend test t/s
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 906.42 ± 1.27
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 25.32 ± 0.28
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d4096 822.09 ± 4.14
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d4096 21.47 ± 0.16
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d8192 736.49 ± 6.00
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d8192 19.03 ± 0.12
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d16384 615.61 ± 5.00
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d16384 15.49 ± 0.22
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d32768 460.02 ± 5.47
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d32768 11.14 ± 0.07

build: 21d31e081 (7122)