Heh, turned out VLLM compiles and runs just fine without NVidia-provided container. Just needed to set an environment variable specifying the arch.
In other news, there is some new activity in amd-dev branch of vllm project, so hopefully some improvements are coming in 0.11.1 release. But amdsmi python package is still crashing, so there is that.
what strang is that the one from fedora 42 work.
$ amd-smi version
AMDSMI Tool: 24.7.1+unknown | AMDSMI Library version: 24.7.1.0 | ROCm version: N/A
no, not this one. amdsmi Python module, and only on cleanup. amd-smi commandline tool works.
look there is someting wrong with amd-smi for rocm-7+:
- fedora 42 / rocm 6.3:
$ amd-smi list
GPU: 0
BDF: 0000:c1:00.0
UUID: 00ff1586-0000-1000-8000-000000000000
KFD_ID: 29672
NODE_ID: 1
PARTITION_ID: 0
- fedora 43 / rocm 6.4:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
BDF: 0000:c1:00.0
UUID: 00ff1586-0000-1000-8000-000000000000
KFD_ID: 29672
NODE_ID: 1
PARTITION_ID: 0
- fedora 44 / rocm 7.0:
$ amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
BDF: N/A
UUID: N/A
KFD_ID: 29672
NODE_ID: 1
PARTITION_ID: 0
Note: all done on toolbox runing on silverbue 42 …
Strix Halo on Framework MB:
- FA: on
- mmap: off
- GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
- ngl: 999
- n_ubatch=4096
- backend: rocm
| model | size | params | test | t/s |
|---|---|---|---|---|
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1 | 4.69 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1 | 4.69 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp2 | 9.19 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp3 | 11.23 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp4 | 12.86 ± 0.01 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp8 | 25.37 ± 0.04 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp12 | 37.53 ± 0.06 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp16 | 49.17 ± 0.08 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp24 | 70.87 ± 0.15 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp32 | 89.94 ± 0.45 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp48 | 122.01 ± 0.61 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp64 | 145.84 ± 0.60 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp96 | 207.52 ± 0.55 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp128 | 269.40 ± 0.95 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp192 | 229.28 ± 0.15 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp256 | 291.95 ± 0.70 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp384 | 358.48 ± 0.89 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp512 | 418.56 ± 0.65 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp768 | 401.40 ± 1.40 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1024 | 438.28 ± 1.35 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp1536 | 439.35 ± 0.80 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp2048 | 438.40 ± 1.04 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp3072 | 432.32 ± 0.48 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp4096 | 423.00 ± 0.47 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | tg16 | 4.69 ± 0.00 |
| Mistral-Small-2506 | 43.91 GiB | 23.57 B | pp512+tg64 | 38.69 ± 0.01 |
The user in toolbox is not a member of the required groups.
:~$ ll /dev/kfd
crw-rw-rw-. 1 root render 235, 0 25 oct. 11:21 /dev/kfd
:~$ ll /dev/dri/renderD128
crw-rw-rw-. 1 root render 226, 128 25 oct. 11:21 /dev/dri/renderD128
it is not needed with fedora, there is rw for all user. So the chech is “wrong”. I never need it on this OS. (may be needed on server / coreOS release?)
and rocm work fine without.
⬢ [zzzzzz@toolbx ~]$ getfacl /dev/dri/card1
getfacl : suppression du premier « / » des noms de chemins absolus
# file: dev/dri/card1
# owner: nobody
# group: nobody
user::rw-
user:4294967295:rw-
group::rw-
mask::rw-
other::---
and have user ACL right too. on cardN.
I try to add groups:
sudo usermod -a -G video,render $LOGNAME
But it did not work , did not add user to the groups.
Edit: find how to add user in this groups on host but not how to have them in toolbox. I have to look what is th “good” way for that (and after if is is realy needed…)
BTW, ran some ComfyUI tests:
- Default Flux.1 dev workflow:
- On Spark: 34 seconds
- On Strix Halo: 98 seconds
- On 4090: 12 seconds
I need to find the time to see how to Install and Use/Bench ComfyUI ![]()
Instructions included in Spark documentation work for Strix Halo too (but you need to install pytorch from TheRock): Try NVIDIA NIM APIs ![]()
thanks.
for other you need to replace step3 with
python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/
I wanted to try it on iGPU of the fw16 but only rocm as been build not pytorch (look a bug)
# may work later for the 780M
python -m pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/
Finally, installed Fedora 43 Server on DGX Spark.
Had to compile Linux kernel from NVidia repository to add a proper NIC driver and enable some GPU optimizations. Model loading now takes the same time as on Strix Halo (as opposed to 5x slower), but the generation performance stays the same as with stock DGX OS. So, a big win.
However, turned out there is something wrong with mmap on this platform. When using mmap, memory loading slows down significantly, like 1 minute 30 seconds to load gpt-oss-120b vs. 22 seconds without mmap. Or 8 minutes 44 seconds for qwen3-next on VLLM with mmap vs 1 minute 30 seconds without it.
Decided to try MiniMax M2 on both. The only quant I could fit with any kind of usable context is Q3_K_XL from Unsloth - allows to run up to 64K context, but consumes pretty much all memory.
The difference in performance is even more noticeable than with gpt-oss-120b, although I have a feeling that it could be improved over time as ROCm support matures. Especially the performance degradation.
build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_MiniMax-M2-GGUF_UD-Q3_K_XL_MiniMax-M2-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
DGX Spark (stock OS):
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 | 892.63 ± 1.17 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 | 29.72 ± 0.04 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d4096 | 814.83 ± 1.29 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d4096 | 25.81 ± 0.07 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d8192 | 750.01 ± 2.47 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d8192 | 21.98 ± 0.06 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d16384 | 639.73 ± 0.73 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d16384 | 17.69 ± 0.03 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | pp2048 @ d32768 | 436.44 ± 12.49 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | CUDA | tg32 @ d32768 | 12.54 ± 0.11 |
build: c4abcb245 (7053)
Strix Halo:
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 | 286.33 ± 0.49 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 | 30.22 ± 0.07 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d4096 | 229.23 ± 0.47 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d4096 | 23.52 ± 0.01 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d8192 | 190.70 ± 0.26 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d8192 | 19.18 ± 0.01 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d16384 | 128.27 ± 0.52 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d16384 | 13.31 ± 0.02 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | pp2048 @ d32768 | 58.44 ± 0.33 |
| minimax-m2 230B.A10B Q3_K - Medium | 94.48 GiB | 228.69 B | ROCm | tg32 @ d32768 | 7.72 ± 0.01 |
build: 45c6ef730 (7058)
Wow!
My post has to be 5 characters so …
Wow!!
I’d say the difference in PP performance makes Spark experience much more enjoyable, especially on long contexts. It’s also dead silent. RTX Pro 6000 would be even better speed wise, and probably is the better deal overall, but for my workloads slightly more VRAM is worth it. And the fact I can keep it at home and run 24/7 without making any noise ![]()
I still have my Strix Halo, but that one is my personal one. I think I’ll install Proxmox, so it could join my cluster and be more useful during idle times.
maybe a dumb question. but im running omarchy (which nirav seems to run to some extent as he has commits in omarchy), and I can’t get LM studio to find the gpu on the framework desktop. meanwhile if i use fedora + lm studio it works without any issue. i started a thread in this forum, but its been a week with 0 hits. so anyway. any chance you’ve used the FW desktop with arch?