Heh, turned out VLLM compiles and runs just fine without NVidia-provided container. Just needed to set an environment variable specifying the arch.
In other news, there is some new activity in amd-dev branch of vllm project, so hopefully some improvements are coming in 0.11.1 release. But amdsmi python package is still crashing, so there is that.
what strang is that the one from fedora 42 work.
$ amd-smi version
AMDSMI Tool: 24.7.1+unknown | AMDSMI Library version: 24.7.1.0 | ROCm version: N/A
no, not this one. amdsmi Python module, and only on cleanup. amd-smi commandline tool works.
look there is someting wrong with amd-smi for rocm-7+:
- fedora 42 / rocm 6.3:
$ amd-smi list
GPU: 0
BDF: 0000:c1:00.0
UUID: 00ff1586-0000-1000-8000-000000000000
KFD_ID: 29672
NODE_ID: 1
PARTITION_ID: 0
- fedora 43 / rocm 6.4:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
BDF: 0000:c1:00.0
UUID: 00ff1586-0000-1000-8000-000000000000
KFD_ID: 29672
NODE_ID: 1
PARTITION_ID: 0
- fedora 44 / rocm 7.0:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
BDF: N/A
UUID: N/A
KFD_ID: 29672
NODE_ID: 1
PARTITION_ID: 0
Note: all done on toolbox runing on silverbue 42 …
Strix Halo on Framework MB:
- FA: on
- mmap: off
- GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
- ngl: 999
- n_ubatch=4096
- backend: rocm
| model | size | params | test | t/s |
|---|---|---|---|---|
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1 | 4.69 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1 | 4.69 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp2 | 9.19 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp3 | 11.23 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp4 | 12.86 ± 0.01 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp8 | 25.37 ± 0.04 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp12 | 37.53 ± 0.06 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp16 | 49.17 ± 0.08 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp24 | 70.87 ± 0.15 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp32 | 89.94 ± 0.45 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp48 | 122.01 ± 0.61 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp64 | 145.84 ± 0.60 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp96 | 207.52 ± 0.55 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp128 | 269.40 ± 0.95 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp192 | 229.28 ± 0.15 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp256 | 291.95 ± 0.70 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp384 | 358.48 ± 0.89 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp512 | 418.56 ± 0.65 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp768 | 401.40 ± 1.40 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1024 | 438.28 ± 1.35 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1536 | 439.35 ± 0.80 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp2048 | 438.40 ± 1.04 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp3072 | 432.32 ± 0.48 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp4096 | 423.00 ± 0.47 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | tg16 | 4.69 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp512+tg64 | 38.69 ± 0.01 |
The user in toolbox is not a member of the required groups.
:~$ ll /dev/kfd
crw-rw-rw-. 1 root render 235, 0 25 oct. 11:21 /dev/kfd
:~$ ll /dev/dri/renderD128
crw-rw-rw-. 1 root render 226, 128 25 oct. 11:21 /dev/dri/renderD128
it is not needed with fedora, there is rw for all user. So the chech is “wrong”. I never need it on this OS. (may be needed on server / coreOS release?)
and rocm work fine without.
⬢ [zzzzzz@toolbx ~]$ getfacl /dev/dri/card1
getfacl : suppression du premier « / » des noms de chemins absolus
# file: dev/dri/card1
# owner: nobody
# group: nobody
user::rw-
user:4294967295:rw-
group::rw-
mask::rw-
other::---
and have user ACL right too. on cardN.
I try to add groups:
sudo usermod -a -G video,render $LOGNAME
But it did not work , did not add user to the groups.
Edit: find how to add user in this groups on host but not how to have them in toolbox. I have to look what is th “good” way for that (and after if is is realy needed…)
BTW, ran some ComfyUI tests:
- Default Flux.1 dev workflow:
- On Spark: 34 seconds
- On Strix Halo: 98 seconds
- On 4090: 12 seconds
I need to find the time to see how to Install and Use/Bench ComfyUI ![]()
Instructions included in Spark documentation work for Strix Halo too (but you need to install pytorch from TheRock): Try NVIDIA NIM APIs ![]()
thanks.
for other you need to replace step3 with
python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/
I wanted to try it on iGPU of the fw16 but only rocm as been build not pytorch (look a bug)
# may work later for the 780M
python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/
Finally, installed Fedora 43 Server on DGX Spark.
Had to compile Linux kernel from NVidia repository to add a proper NIC driver and enable some GPU optimizations. Model loading now takes the same time as on Strix Halo (as opposed to 5x slower), but the generation performance stays the same as with stock DGX OS. So, a big win.
However, turned out there is something wrong with mmap on this platform. When using mmap, memory loading slows down significantly, like 1 minute 30 seconds to load gpt-oss-120b vs. 22 seconds without mmap. Or 8 minutes 44 seconds for qwen3-next on VLLM with mmap vs 1 minute 30 seconds without it.
Decided to try MiniMax M2 on both. The only quant I could fit with any kind of usable context is Q3_K_XL from Unsloth - allows to run up to 64K context, but consumes pretty much all memory.
The difference in performance is even more noticeable than with gpt-oss-120b, although I have a feeling that it could be improved over time as ROCm support matures. Especially the performance degradation.
build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_MiniMax-M2-GGUF_UD-Q3_K_XL_MiniMax-M2-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
DGX Spark (stock OS):
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 | 892.63 ± 1.17 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 | 29.72 ± 0.04 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d4096 | 814.83 ± 1.29 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d4096 | 25.81 ± 0.07 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d8192 | 750.01 ± 2.47 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d8192 | 21.98 ± 0.06 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d16384 | 639.73 ± 0.73 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d16384 | 17.69 ± 0.03 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d32768 | 436.44 ± 12.49 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d32768 | 12.54 ± 0.11 |
build: c4abcb245 (7053)
Strix Halo:
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 | 286.33 ± 0.49 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 | 30.22 ± 0.07 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d4096 | 229.23 ± 0.47 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d4096 | 23.52 ± 0.01 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d8192 | 190.70 ± 0.26 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d8192 | 19.18 ± 0.01 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d16384 | 128.27 ± 0.52 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d16384 | 13.31 ± 0.02 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d32768 | 58.44 ± 0.33 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d32768 | 7.72 ± 0.01 |
build: 45c6ef730 (7058)
Wow!
My post has to be 5 characters so …
Wow!!
I’d say the difference in PP performance makes Spark experience much more enjoyable, especially on long contexts. It’s also dead silent. RTX Pro 6000 would be even better speed wise, and probably is the better deal overall, but for my workloads slightly more VRAM is worth it. And the fact I can keep it at home and run 24/7 without making any noise ![]()
I still have my Strix Halo, but that one is my personal one. I think I’ll install Proxmox, so it could join my cluster and be more useful during idle times.
maybe a dumb question. but im running omarchy (which nirav seems to run to some extent as he has commits in omarchy), and I can’t get LM studio to find the gpu on the framework desktop. meanwhile if i use fedora + lm studio it works without any issue. i started a thread in this forum, but its been a week with 0 hits. so anyway. any chance you’ve used the FW desktop with arch?
With these reports, and the fantastic job you’ve done in this thread Eugr … Thank You! … The incessant perma-delays Framework is putting its US Mainboards customers thru has me one frustrated urge away from clicking the Buy button on the product page for the ASUS Ascent GX10 and cancelling my order for the FW mainboard.
Experimenting with Dual Sparks connected at 200 Gbps.
Running llama.cpp with RPC backend. Not optimal as it uses a regular TCP/IP stack instead of ultra-low latency Infiniband, but still. I wonder how this would change if llama.cpp supported NCCL - I’m able to get 1-2 microsecond (!) latency measured by ib_send_lat vs ~0.9 ms reported by ping. That’s 900x difference!
Anyway, Qwen3-235B at Q4_K_XL quant fits on dual sparks with full context and room to spare (after all, I was able to fit Q3_K_XL on a single Spark, but with 64K context, I believe). I could go to Q6_K_XL.
build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-VL-235B-A22B-Instruct-GGUF_UD-Q4_K_XL_Qwen3-VL-235B-A22B-Instruct-UD-Q4_K_XL-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | pp2048 | 545.20 ± 1.19 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | tg32 | 13.39 ± 0.08 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | pp2048 @ d4096 | 496.61 ± 0.44 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | tg32 @ d4096 | 12.04 ± 0.05 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | pp2048 @ d8192 | 448.86 ± 1.04 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | tg32 @ d8192 | 11.64 ± 0.11 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | pp2048 @ d16384 | 373.40 ± 0.44 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | tg32 @ d16384 | 10.31 ± 0.06 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | pp2048 @ d32768 | 280.64 ± 0.57 |
| qwen3vlmoe 235B.A22B Q4_K - Medium | 124.91 GiB | 235.09 B | CUDA,RPC | tg32 @ d32768 | 8.58 ± 0.04 |
build: 21d31e081 (7122)
Minimax M2 Q4_K_XL on dual Sparks:
build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_MiniMax-M2-GGUF_UD-Q4_K_XL_MiniMax-M2-UD-Q4_K_XL-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | pp2048 | 906.42 ± 1.27 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | tg32 | 25.32 ± 0.28 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | pp2048 @ d4096 | 822.09 ± 4.14 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | tg32 @ d4096 | 21.47 ± 0.16 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | pp2048 @ d8192 | 736.49 ± 6.00 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | tg32 @ d8192 | 19.03 ± 0.12 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | pp2048 @ d16384 | 615.61 ± 5.00 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | tg32 @ d16384 | 15.49 ± 0.22 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | pp2048 @ d32768 | 460.02 ± 5.47 |
| minimax-m2 230B.A10B Q4_K - Medium | 122.58 GiB | 228.69 B | CUDA,RPC | tg32 @ d32768 | 11.14 ± 0.07 |
build: 21d31e081 (7122)