DGX Spark vs. Strix Halo - Initial Impressions

Heh, turned out VLLM compiles and runs just fine without NVidia-provided container. Just needed to set an environment variable specifying the arch.

1 Like

In other news, there is some new activity in amd-dev branch of vllm project, so hopefully some improvements are coming in 0.11.1 release. But amdsmi python package is still crashing, so there is that.

1 Like

what strang is that the one from fedora 42 work.

$ amd-smi version
AMDSMI Tool: 24.7.1+unknown | AMDSMI Library version: 24.7.1.0 | ROCm version: N/A

no, not this one. amdsmi Python module, and only on cleanup. amd-smi commandline tool works.

look there is someting wrong with amd-smi for rocm-7+:

  • fedora 42 / rocm 6.3:
$ amd-smi list
GPU: 0
    BDF: 0000:c1:00.0
    UUID: 00ff1586-0000-1000-8000-000000000000
    KFD_ID: 29672
    NODE_ID: 1
    PARTITION_ID: 0
  • fedora 43 / rocm 6.4:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
    BDF: 0000:c1:00.0
    UUID: 00ff1586-0000-1000-8000-000000000000
    KFD_ID: 29672
    NODE_ID: 1
    PARTITION_ID: 0
  • fedora 44 / rocm 7.0:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
    BDF: N/A
    UUID: N/A
    KFD_ID: 29672
    NODE_ID: 1
    PARTITION_ID: 0

Note: all done on toolbox runing on silverbue 42 …

Strix Halo on Framework MB:

  • FA: on
  • mmap: off
  • GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
  • ngl: 999
  • n_ubatch=4096
  • backend: rocm
model size params test t/s
Mistral-Small-2506 43.91 GiB 23.57 B pp1 4.69 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp1 4.69 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp2 9.19 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp3 11.23 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp4 12.86 ± 0.01
Mistral-Small-2506 43.91 GiB 23.57 B pp8 25.37 ± 0.04
Mistral-Small-2506 43.91 GiB 23.57 B pp12 37.53 ± 0.06
Mistral-Small-2506 43.91 GiB 23.57 B pp16 49.17 ± 0.08
Mistral-Small-2506 43.91 GiB 23.57 B pp24 70.87 ± 0.15
Mistral-Small-2506 43.91 GiB 23.57 B pp32 89.94 ± 0.45
Mistral-Small-2506 43.91 GiB 23.57 B pp48 122.01 ± 0.61
Mistral-Small-2506 43.91 GiB 23.57 B pp64 145.84 ± 0.60
Mistral-Small-2506 43.91 GiB 23.57 B pp96 207.52 ± 0.55
Mistral-Small-2506 43.91 GiB 23.57 B pp128 269.40 ± 0.95
Mistral-Small-2506 43.91 GiB 23.57 B pp192 229.28 ± 0.15
Mistral-Small-2506 43.91 GiB 23.57 B pp256 291.95 ± 0.70
Mistral-Small-2506 43.91 GiB 23.57 B pp384 358.48 ± 0.89
Mistral-Small-2506 43.91 GiB 23.57 B pp512 418.56 ± 0.65
Mistral-Small-2506 43.91 GiB 23.57 B pp768 401.40 ± 1.40
Mistral-Small-2506 43.91 GiB 23.57 B pp1024 438.28 ± 1.35
Mistral-Small-2506 43.91 GiB 23.57 B pp1536 439.35 ± 0.80
Mistral-Small-2506 43.91 GiB 23.57 B pp2048 438.40 ± 1.04
Mistral-Small-2506 43.91 GiB 23.57 B pp3072 432.32 ± 0.48
Mistral-Small-2506 43.91 GiB 23.57 B pp4096 423.00 ± 0.47
Mistral-Small-2506 43.91 GiB 23.57 B tg16 4.69 ± 0.00
Mistral-Small-2506 43.91 GiB 23.57 B pp512+tg64 38.69 ± 0.01

The user in toolbox is not a member of the required groups.

:~$ ll /dev/kfd
crw-rw-rw-. 1 root render 235, 0 25 oct.  11:21 /dev/kfd
:~$ ll /dev/dri/renderD128 
crw-rw-rw-. 1 root render 226, 128 25 oct.  11:21 /dev/dri/renderD128

it is not needed with fedora, there is rw for all user. So the chech is “wrong”. I never need it on this OS. (may be needed on server / coreOS release?)
and rocm work fine without.

⬢ [zzzzzz@toolbx ~]$ getfacl /dev/dri/card1 
getfacl : suppression du premier « / » des noms de chemins absolus
# file: dev/dri/card1
# owner: nobody
# group: nobody
user::rw-
user:4294967295:rw-
group::rw-
mask::rw-
other::---

and have user ACL right too. on cardN.

I try to add groups:

sudo usermod -a -G video,render $LOGNAME

But it did not work , did not add user to the groups.

Edit: find how to add user in this groups on host but not how to have them in toolbox. I have to look what is th “good” way for that (and after if is is realy needed…)

BTW, ran some ComfyUI tests:

  • Default Flux.1 dev workflow:
    • On Spark: 34 seconds
    • On Strix Halo: 98 seconds
    • On 4090: 12 seconds

I need to find the time to see how to Install and Use/Bench ComfyUI :wink:

Instructions included in Spark documentation work for Strix Halo too (but you need to install pytorch from TheRock): Try NVIDIA NIM APIs :slight_smile:

1 Like

thanks.

for other you need to replace step3 with

python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/

I wanted to try it on iGPU of the fw16 but only rocm as been build not pytorch (look a bug)

# may work later for the 780M
python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/

Finally, installed Fedora 43 Server on DGX Spark.

Had to compile Linux kernel from NVidia repository to add a proper NIC driver and enable some GPU optimizations. Model loading now takes the same time as on Strix Halo (as opposed to 5x slower), but the generation performance stays the same as with stock DGX OS. So, a big win.

However, turned out there is something wrong with mmap on this platform. When using mmap, memory loading slows down significantly, like 1 minute 30 seconds to load gpt-oss-120b vs. 22 seconds without mmap. Or 8 minutes 44 seconds for qwen3-next on VLLM with mmap vs 1 minute 30 seconds without it.

1 Like

Decided to try MiniMax M2 on both. The only quant I could fit with any kind of usable context is Q3_K_XL from Unsloth - allows to run up to 64K context, but consumes pretty much all memory.

The difference in performance is even more noticeable than with gpt-oss-120b, although I have a feeling that it could be improved over time as ROCm support matures. Especially the performance degradation.

build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_MiniMax-M2-GGUF_UD-Q3_K_XL_MiniMax-M2-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

DGX Spark (stock OS):

model size params backend test t/s
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 892.63 ± 1.17
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 29.72 ± 0.04
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d4096 814.83 ± 1.29
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d4096 25.81 ± 0.07
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d8192 750.01 ± 2.47
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d8192 21.98 ± 0.06
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d16384 639.73 ± 0.73
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d16384 17.69 ± 0.03
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA pp2048 @ d32768 436.44 ± 12.49
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B CUDA tg32 @ d32768 12.54 ± 0.11

build: c4abcb245 (7053)

Strix Halo:

model size params backend test t/s
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 286.33 ± 0.49
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 30.22 ± 0.07
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d4096 229.23 ± 0.47
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d4096 23.52 ± 0.01
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d8192 190.70 ± 0.26
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d8192 19.18 ± 0.01
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d16384 128.27 ± 0.52
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d16384 13.31 ± 0.02
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm pp2048 @ d32768 58.44 ± 0.33
minimax-m2 230B.A10B Q3_K - Medium 94.48 GiB 228.69 B ROCm tg32 @ d32768 7.72 ± 0.01

build: 45c6ef730 (7058)

1 Like

Wow!

My post has to be 5 characters so …

Wow!!

I’d say the difference in PP performance makes Spark experience much more enjoyable, especially on long contexts. It’s also dead silent. RTX Pro 6000 would be even better speed wise, and probably is the better deal overall, but for my workloads slightly more VRAM is worth it. And the fact I can keep it at home and run 24/7 without making any noise :slight_smile:

I still have my Strix Halo, but that one is my personal one. I think I’ll install Proxmox, so it could join my cluster and be more useful during idle times.

maybe a dumb question. but im running omarchy (which nirav seems to run to some extent as he has commits in omarchy), and I can’t get LM studio to find the gpu on the framework desktop. meanwhile if i use fedora + lm studio it works without any issue. i started a thread in this forum, but its been a week with 0 hits. so anyway. any chance you’ve used the FW desktop with arch?