DGX Spark vs. Strix Halo - Initial Impressions

Just wanted to share my initial impressions after using both Strix Halo (GMKTek Evo x2 128GB) and NVidia DGX Spark as an AI developer.

I posted some of my experience with both platforms in Batch 13 thread, but decided to make a separate post.

Hardware

DGX Spark is probably the most minimalist mini-PC I’ve ever used.

It has absolutely no LEDs, not even in the LAN port, and on/off switch is a button, so unless you ping it over the network or hook up a display, good luck guessing if this thing is on.
All ports are in the back, there is no Display Port, only a single HDMI port, USB-C (power only), 3x USB-C 3.2 gen 2 ports, 10G ethernet port and 2x QSFP ports.

The air intake is in the front and exhaust is in the back. It is quiet for the most part, but the fan is quite audible when it’s on (but quieter than my GMKTek).

It has a single 4TB PciE 5.0x4 M.2 2242 SSD - SAMSUNG MZALC4T0HBL1-00B07 which I couldn’t find anywhere for sale in 2242 form factor, only 2280 version, but DGX Spark only takes 2242 drives. I wish they went with standard 2280 - weird decision, given that it’s a mini-PC, not a laptop or tablet. Who cares if the motherboard is an inch longer!

The performance seems good, and gives me 4240.64 MB/sec vs 3118.53 MB/sec on my GMKTek (as measured by hdparm).

It is user replaceable, but there is only one slot, accessible from the bottom of the device. You need to take the magnetic plate off and there are some access screws underneath.

The unit is made of metal, and gets quite hot during high loads, but not unbearable hot like some reviews mentioned. Cools down quickly, though (metal!).

The CPU is 20 core ARM with 10 performance and 10 efficiency cores. I didn’t benchmark them, but other reviews CPU show performance similar to Strix Halo.

Initial Setup

DGX Spark comes with DGX OS pre-installed (more on this later). You can set it up interactively using keyboard/mouse/display or in headless mode via WiFi hotspot that it creates.

I tried to set it up by connecting my trusted Logitech keyboard/trackpad combo that I use to set up pretty much all my server boxes, but once it booted up, it displayed “Connect the keyboard” message and didn’t let me proceed any further. Trackpad portion worked, and volume keys on the keyboard also worked! I rebooted, and was able to enter BIOS (by pressing Esc) just fine, and the keyboard was fully functioning there!

BTW, it has AMI BIOS, but doesn’t expose anything interesting other than networking and boot options.

Booting into DGX OS resulted in the same problem. After some googling, I figured that it shipped with a borked kernel that broke Logitech unified setups, so I decided to proceed in a headless mode.

Connected to the Wifi hotspot from my Mac (hotspot SSID/password are printed on a sticker on top of the quick start guide) and was able to continue set up there, which was pretty smooth, other than Mac spamming me with “connect to internet” popup every minute or so. It then proceeded to update firmware and OS packages, which took about 30 minutes, but eventually finished, and after that my Logitech keyboard worked just fine.

Linux Experience

DGX Spark runs DGX OS 7.2.3 which is based on Ubuntu 24.04.3 LTS, but uses NVidia’s custom kernel, and an older one than mainline Ubuntu LTS uses.
So instead of 6.14.x you get 6.11.0-1016-nvidia.

It comes with CUDA 13.0 development kit and NVidia drivers (580.95.05) pre-installed.
It also has NVidia’s container toolkit that includes docker, and GPU passthrough works well.

Other than that, it’s a standard Ubuntu Desktop installation, with GNOME and everything.

SSHd is enabled by default, so after headless install you can connect to it immediately without any extra configuration.

RDP remote desktop doesn’t work currently - it connects, but display output is broken.

I tried to boot from Fedora 43 Beta Live USB, and it worked, sort of. First, you need to disable Secure Boot in BIOS. Then, it boots only in “basic graphics mode”, because built-in nvidia drivers don’t recognize the chipset. It also throws other errors complaining about chipset, processor cores, etc.

I think I’ll try to install it to an external SSD and see if NVidia standard drivers will recognize the chip. There is hope:

==============

PLATFORM INFO:

IOMMU: Pass-through or enabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 13000
Platform: NVIDIA_DGX_Spark, Arch: aarch64(Linux 6.11.0-1016-nvidia)
Platform verification succeeded

As for Strix Halo, it’s an x86 PC, so you can run any distro you want. I chose Fedora 43 Beta, currently running with kernel 6.17.3-300.fc43.x86_64.
Smooth sailing, up-to-date packages.

Llama.cpp experience

DGX Spark

You need to build it from source as there is no CUDA ARM build, but compiling llama.cpp was very straightforward - CUDA toolkit is already installed, just need to install development tools and it compiles just like on any other system with NVidia GPU. Just follow the instructions, no surprises.

However, when I ran the benchmarks, I ran into two issues.

  1. The model loading was VERY slow. It took 1 minute 40 seconds to load gpt-oss-120b. For comparison, it takes 22 seconds to load on Strix Halo (both from cold, memory cache flushed).
  2. I wasn’t getting the same results as ggerganov in this thread. While PP was pretty impressive for such a small system, TG was matching or even slightly worse than my Strix Halo setup with ROCm.

For instance, here are my Strix Halo numbers, compiled with ROCm 7.10.0a20251017, llama.cpp build 03792ad9 (6816), HIP only, no rocWMMA:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 999.59 ± 4.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 47.49 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 824.37 ± 1.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 44.23 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 703.42 ± 1.54
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 42.52 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 514.89 ± 3.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 39.71 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 348.59 ± 2.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 35.39 ± 0.01
The same command on Spark gave me this:
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1816.00 ± 11.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 44.74 ± 0.99
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1763.75 ± 6.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 42.69 ± 0.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1695.29 ± 11.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 40.91 ± 0.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1512.65 ± 6.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 38.61 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1250.55 ± 5.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 34.66 ± 0.02
I tried enabling Unified Memory switch (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1) - it improved model loading, but resulted in even worse performance.

I reached out to ggerganov, and he suggested disabling mmap. I thought I tried it, but apparently not.
Well, that fixed it. Model loading improved too - now taking 56 seconds from cold and 23 seconds when it’s still in cache.

Updated numbers:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1939.32 ± 4.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 56.33 ± 0.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1832.04 ± 5.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 52.63 ± 0.12
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1738.07 ± 5.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 48.60 ± 0.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1525.71 ± 12.34
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 45.01 ± 0.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1242.35 ± 5.64
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 39.10 ± 0.09
As you can see, much better performance both in PP and TG.

As for Strix Halo, mmap/no-mmap doesn’t make any difference there.

Strix Halo

On Strix Halo, llama.cpp experience is… well, a bit turbulent.

You can download a pre-built version for Vulkan, and it works, but the performance is a mixed bag. TG is pretty good, but PP is not great.

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 --mmap 0 -ngl 999 -ub 1024

NOTE: Vulkan likes batch size of 1024 the most, unlike ROCm that likes 2048 better.

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 526.54 ± 4.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 52.64 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d4096 438.85 ± 0.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d4096 48.21 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d8192 356.28 ± 4.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d8192 45.90 ± 0.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d16384 210.17 ± 2.53
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d16384 42.64 ± 0.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d32768 138.79 ± 9.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d32768 36.18 ± 0.02

I tried toolboxes from kyuz0, and some of them were better, but I still felt that I could squeeze more juice out of it. All of them suffered from significant performance degradation when the context was filling up.

Then I tried to compile my own using the latest ROCm build from TheRock (on that date).

I also build rocWMMA as recommended by kyoz0 (more on that later).

Llama.cpp compiled without major issues - I had to configure the paths properly, but other than that, it just worked.
The PP increased dramatically, but TG decreased.

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 1030.71 ± 2.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 47.84 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d4096 802.36 ± 6.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d4096 39.09 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d8192 615.27 ± 2.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d8192 33.34 ± 0.05
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d16384 409.25 ± 0.67
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d16384 25.86 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d32768 228.04 ± 0.44
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d32768 18.07 ± 0.03

But the biggest issue is significant performance degradation with long context, much more than you’d expect.

Then I stumbled upon Lemonade SDK and their pre-built llama.cpp. Ran that one, and got much better results across the board. TG was still below Vulkan, but PP was decent and degradation wasn’t as bad:

model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 999.20 ± 3.44
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 47.53 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d4096 826.63 ± 9.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d4096 44.24 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d8192 702.66 ± 2.15
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d8192 42.56 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d16384 505.85 ± 1.33
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d16384 39.82 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d32768 343.06 ± 2.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d32768 35.50 ± 0.02
So I looked at their compilation options and noticed that they build without rocWMMA. So, I did the same and got similar performance too!
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 1000.93 ± 1.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 47.46 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 827.34 ± 1.99
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 44.20 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 701.68 ± 2.36
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 42.39 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 503.49 ± 0.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 39.61 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 344.36 ± 0.80
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 35.32 ± 0.01
So far that’s the best I could get from Strix Halo. It’s very usable for text generation tasks.

Also, wanted to touch multi-modal performance. That’s where Spark shines. I don’t have any specific benchmarks yet, but image processing is much faster on Spark than on Strix Halo, especially in vLLM.

VLLM Experience

Haven’t had a chance to do extensive testing here, but wanted to share some early thoughts.

DGX Spark

First, I tried to just build vLLM from the source as usual. The build was successful, but it failed with the following error: ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’

I decided not to spend too much time on this for now, and just launched vLLM container that NVidia provides through their Docker repository.
It is built for DGX Spark, so supports it out of the box.

However, it has version 0.10.1, so I wasn’t able to run Qwen3-VL there.

Now, they put the source code inside the container, but it wasn’t a git repository - probably contains some NVidia-specific patches - I’ll need to see if those could be merged into main vllm code.

So I just checked out vllm main branch and proceeded to build with existing pytorch as usual. This time I was able to run it and launch qwen3-vl models just fine.
Both dense and MOE work. I tried FP4 and AWQ quants - everything works, no need to disable CUDA graphs.

The performance is decent - I still need to run some benchmarks, but image processing is very fast.

Strix Halo

Unlike llama.cpp that just works, vLLM experience on Strix Halo is much more limited.

My goal was to run Qwen3-VL models that are not supported by llama.cpp yet, so I needed to build 0.11.0 or later. There are some existing containers/toolboxes for earlier versions, but I couldn’t use them.

So, I installed ROCm pyTorch libraries from TheRock, some patches from kyoz0 toolboxes to avoid amdsmi package crash, ROCm FlashAttention and then just followed vLLM standard installation instructions with existing pyTorch.

I was able to run Qwen3VL dense models with decent (for dense models) speeds, although initialization takes quite some time until you reduce -max-num-seqs to 1 and set tp 1.
The image processing is very slow though, much slower than llama.cpp for the same image, but the token generation is about what you’d expect from it.

Again, model loading is faster than Spark for some reason (I’d expect other way around given faster SSD in Spark and slightly faster memory).

I’m going to rebuild vLLM and re-test/benchmark later.

Some observations:

  • FP8 models don’t work - they hang on WARNING 10-22 12:55:04 [fp8_utils.py:785] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/eugr/vllm/vllm/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=2560,device_name=Radeon_8060S_Graphics,dtype=fp8_w8a8,block_shape=[128,128].json
  • You need to use --enforce-eager, as CUDA graphs crash vLLM. Sometimes it works, but mostly crashes.
  • Even with --enforce-eager, there are some HIP-related crashes here and there occasionally.
  • AWQ models work, both 4-bit and 8-bit, but only dense ones. AWQ MOE quants require Marlin kernel that is not available for ROCm.

Conclusion / TL;DR

Summary of my initial impressions:

  • DGX Spark is an interesting beast for sure.
    • Limited extensibility - no USB-4, only one M.2 slot, and it’s 2242.
    • But has 200Gbps network interface.
  • It’s a first generation of such devices, so there are some annoying bugs and incompatibilities.
  • Inference wise, the token generation is nearly identical to Strix Halo both in llama.cpp and vllm, but prompt processing is 2-5x higher than Strix Halo.
    • Strix Halo performance in prompt processing degrades much faster with context.
    • Image processing takes longer, especially with vLLM.
    • Model loading into unified RAM is slower on DGX Spark for some reason, both in llama.cpp and vLLM.
  • Even though vLLM included gfx1151 in the supported configurations, it still requires some hacks to compile it.
    • And even then, the experience is suboptimal. Initialization time is slow, it crashes, FP8 doesn’t work, AWQ for MOE doesn’t work.
  • If you are an AI developer who uses transformers/pyTorch or you need vLLM - you are better off with DGX Spark (or just a normal GPU build).
  • If you want a power-efficient inference server that can run gpt-oss and similar MOE at decent speeds, and don’t need to process images often, Strix Halo is the way to go.
  • If you want a general purpose machine, Strix Halo wins too.
10 Likes

curious to have some bench with this CPU.

with the framework I have:

./llama-bench \
     -ctk bf16 -ctv bf16 \   (or -ctk q8_0 -ctv q8_0)
     -ub 4096 -fa "0,1" \
     -p "1,2,4,8,16,32,64,128,256,512,1024,2048,4096" -n 32 \
     -m ./Mistral-Small-2506-BF16.gguf  ( or openai_gpt-oss-120b-MXFP4.gguf, Mistral-Small-2506-Q6_K.gguf)
  • for llama.cpp
- backend: CPU
- threads: 16
- n_ubatch: 4096
- type_vk: bf16
model size params fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp1 27.89 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp2 32.43 ± 0.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp4 48.68 ± 1.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp8 64.52 ± 0.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp16 77.86 ± 1.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp32 89.48 ± 0.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp64 100.27 ± 0.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp128 103.49 ± 0.82
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp256 107.07 ± 0.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp512 108.83 ± 0.14
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp1024 105.36 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp2048 96.97 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 pp4096 93.15 ± 0.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 0 tg32 27.91 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp1 28.68 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp2 39.44 ± 0.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp4 54.11 ± 1.14
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp8 74.42 ± 1.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp16 88.83 ± 3.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp32 95.95 ± 2.37
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp64 102.33 ± 0.98
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp128 107.60 ± 0.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp256 110.22 ± 0.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp512 107.88 ± 0.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp1024 104.05 ± 0.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp2048 92.59 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 pp4096 71.20 ± 0.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 tg32 28.62 ± 0.03
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp1 2.53 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp2 4.76 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp4 9.43 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp8 18.33 ± 0.01
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp16 34.80 ± 0.01
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp32 60.04 ± 0.04
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp64 87.12 ± 0.99
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp128 90.08 ± 0.03
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp256 89.81 ± 0.05
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp512 96.45 ± 0.11
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp1024 93.82 ± 0.04
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp2048 89.46 ± 0.17
Mistral-Small BF16 43.91 GiB 23.57 B 0 pp4096 83.31 ± 1.16
Mistral-Small BF16 43.91 GiB 23.57 B 0 tg32 2.53 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp1 2.54 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp2 4.83 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp4 9.58 ± 0.01
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp8 18.70 ± 0.01
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp16 35.99 ± 0.01
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp32 63.10 ± 0.02
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp64 89.64 ± 0.32
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp128 89.56 ± 0.08
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp256 87.10 ± 0.78
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp512 72.53 ± 1.16
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp1024 58.33 ± 0.33
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp2048 43.48 ± 0.32
Mistral-Small BF16 43.91 GiB 23.57 B 1 pp4096 28.45 ± 0.25
Mistral-Small BF16 43.91 GiB 23.57 B 1 tg32 2.55 ± 0.00
  • for ik_llama.cpp it is much faster on FA and with quantized model:
- backend: CPU
- threads: 16
- n_ubatch: 4096
model size params kv fa test t/s
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp1 28.52 ± 0.02
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp2 41.77 ± 1.47
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp4 63.09 ± 4.71
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp8 92.49 ± 4.12
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp16 116.18 ± 4.06
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp32 149.21 ± 4.35
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp64 216.87 ± 5.22
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp128 278.34 ± 6.63
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp256 320.98 ± 1.92
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp512 332.53 ± 2.02
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp1024 306.91 ± 1.25
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp2048 228.94 ± 3.83
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 pp4096 200.31 ± 2.29
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 0 tg32 28.58 ± 0.03
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp1 28.67 ± 0.01
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp2 41.73 ± 1.88
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp4 64.59 ± 1.74
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp8 90.43 ± 6.76
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp16 116.63 ± 4.61
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp32 149.20 ± 3.57
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp64 220.87 ± 3.26
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp128 281.63 ± 10.09
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp256 335.13 ± 3.83
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp512 387.05 ± 4.75
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp1024 419.31 ± 4.89
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp2048 432.76 ± 2.65
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 pp4096 415.24 ± 1.01
gpt-oss 120B MXFP4 59.02 GiB 116.83 B bf16 1 tg32 28.73 ± 0.02
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp1 28.55 ± 0.04
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp2 42.46 ± 1.55
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp4 62.05 ± 6.83
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp8 92.30 ± 3.85
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp16 115.83 ± 3.81
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp32 149.40 ± 3.75
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp64 218.15 ± 5.20
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp128 288.04 ± 7.47
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp256 348.56 ± 1.85
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp512 390.93 ± 2.66
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp1024 432.22 ± 3.77
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp2048 449.12 ± 1.44
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 pp4096 432.58 ± 0.94
gpt-oss 120B MXFP4 59.02 GiB 116.83 B q8_0 1 tg32 28.69 ± 0.04
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp1 2.43 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp2 4.83 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp4 9.60 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp8 16.40 ± 0.01
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp16 28.18 ± 0.02
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp32 48.80 ± 0.09
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp64 68.87 ± 1.03
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp128 78.35 ± 0.51
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp256 82.86 ± 0.82
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp512 81.20 ± 1.98
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp1024 79.87 ± 0.60
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp2048 76.31 ± 0.56
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 pp4096 71.55 ± 0.59
Mistral-Small BF16 43.91 GiB 23.57 B bf16 0 tg32 2.42 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp1 2.42 ± 0.00
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp2 4.81 ± 0.03
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp4 9.57 ± 0.05
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp8 16.42 ± 0.03
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp16 28.22 ± 0.08
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp32 48.70 ± 0.10
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp64 68.88 ± 1.37
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp128 78.62 ± 0.32
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp256 83.97 ± 1.29
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp512 81.39 ± 1.47
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp1024 79.39 ± 1.66
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp2048 78.16 ± 0.51
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 pp4096 77.21 ± 1.11
Mistral-Small BF16 43.91 GiB 23.57 B bf16 1 tg32 2.43 ± 0.00
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp1 5.94 ± 0.03
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp2 11.51 ± 0.21
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp4 21.03 ± 0.27
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp8 29.97 ± 4.07
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp16 38.42 ± 0.04
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp32 43.22 ± 0.11
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp64 87.87 ± 0.83
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp128 112.88 ± 0.45
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp256 131.94 ± 0.53
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp512 143.66 ± 0.31
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp1024 147.06 ± 0.35
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp2048 144.41 ± 1.99
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 pp4096 141.16 ± 1.01
Mistral-Small Q6_K 18.31 GiB 23.57 B bf16 1 tg32 5.96 ± 0.00
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp1 5.96 ± 0.00
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp2 11.60 ± 0.01
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp4 21.23 ± 0.02
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp8 31.81 ± 0.73
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp16 38.51 ± 0.07
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp32 43.33 ± 0.09
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp64 87.93 ± 0.91
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp128 112.92 ± 0.37
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp256 131.43 ± 0.29
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp512 144.43 ± 0.06
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp1024 148.03 ± 1.10
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp2048 146.93 ± 1.44
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 pp4096 142.78 ± 2.26
Mistral-Small Q6_K 18.31 GiB 23.57 B q8_0 1 tg32 5.96 ± 0.00

(for the DGX Spark it may be faster with fp16 . )

1 Like

It took forever, but here we go. I made sure that all GPU offload is off, but later recompiled llama.cpp with only GGML_NATIVE=ON and GGML_CUDA=OFF, and got the same results.

I confirmed that CPU was at 100% and GPU was at 0%.

Not great, but understandable, since there is no optimized CPU backend for this ARM CPU, it just uses generic one as opposed to AMD which can use AVX512, etc:

-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native 
eugr@spark:~/llm/llama.cpp$ build/bin/llama-bench -ctk bf16 -ctv bf16 \
     -ub 4096 -fa "0,1"  \
     -p "1,2,4,8,16,32,64,128,256,512,1024,2048,4096" -n 32  \
     -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf  \
     -ngl 0 -nkvo 1 -nopo 1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 12.41 ± 1.80
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 8.75 ± 2.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 17.06 ± 0.55
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 19.07 ± 2.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 21.17 ± 3.66
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 24.80 ± 2.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 30.61 ± 0.61
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 38.11 ± 1.99
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 45.45 ± 0.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 43.21 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1024 35.06 ± 0.05
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 24.97 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4096 21.57 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 9.75 ± 1.05
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 13.01 ± 1.79
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 11.17 ± 5.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 15.84 ± 3.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 20.77 ± 1.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 22.56 ± 4.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 25.66 ± 2.41
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 30.80 ± 0.39
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 33.21 ± 0.22
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 48.87 ± 0.54
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 53.08 ± 0.51
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1024 52.24 ± 0.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 47.27 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4096 39.49 ± 0.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 9.24 ± 0.81

build: 9285325c (6817)

1 Like

You can try with f16 for kv (and not bf16, ryzen have AVX512-BF16, but ARM have FP16 vector) and there is some spacial path for it the same way there is for x86.

what is the more “interensing” is

=> so the CPU RAM acces is 3x slower than for ryzen. that may be the point of slow model loading.

GitHub - ikawrakow/ik_llama.cpp: llama.cpp fork with additional SOTA quants and improved performance may have better perf too on this quantized model

It shouldn’t be. According to this analysis, memory controller is on CPU die:

The Blackwell iGPU die , as expected, is based on NVIDIA’s Blackwell GPU architecture. It doesn’t have dedicated memory of its own. Instead, it connects to the system’s LPDDR5X DRAM through memory controllers located in the MediaTek CPU die . The C2C interconnect provides around 600 GB/s of aggregate bandwidth , more than enough for the iGPU to access the full system memory bandwidth.

So, not sure what’s going on. Maybe kernel issues?

1 Like

aggregate for NVIDIA => in + out == 300GB/s of bandwidth. and it is from CPU part to GPU part. It doesn’t indicate what the CPU can use. (And it is use for cache coherence).

:crossed_fingers: I am wrong…

on framwork with samsung 4To 990 pro I get:

zzzz@max:~/LLM$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 8192 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing O_DIRECT cached reads:   8360 MB in  2.00 seconds = 4180.61 MB/sec
 Timing O_DIRECT disk reads: 7964 MB in  3.00 seconds = 2654.63 MB/sec

zzzz@max:~/LLM$ sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 8192 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing cached reads:   76796 MB in  2.00 seconds = 38483.65 MB/sec
 Timing buffered disk reads: 18654 MB in  3.00 seconds = 6216.83 MB/sec

Reran mine with your parameters:

DGX Spark (stock 4TB):

eugr@spark:~$ sudo nvme id-ctrl /dev/nvme0n1 | grep -e "mn\s"
mn        : SAMSUNG MZALC4T0HBL1-00B07

eugr@spark:~$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3907018/64/32, sectors = 8001573552, start = 0
 Timing O_DIRECT cached reads:   16652 MB in  2.00 seconds = 8338.15 MB/sec
 Timing O_DIRECT disk reads: 13506 MB in  3.00 seconds = 4501.88 MB/sec
 
eugr@spark:~$ sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3907018/64/32, sectors = 8001573552, start = 0
 Timing cached reads:   45626 MB in  1.99 seconds = 22871.34 MB/sec
 Timing buffered disk reads: 2664 MB in  3.00 seconds = 887.86 MB/sec

GMKTek Evo X2 (stock 2TB):

eugr@ai:~$ sudo nvme id-ctrl /dev/nvme0n1 | grep -e "mn\s"
mn        : Lexar SSD ARES 2TB

eugr@ai:~$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1953514/64/32, sectors = 4000797360, start = 0
 Timing O_DIRECT cached reads:   7324 MB in  2.00 seconds = 3662.68 MB/sec
 Timing O_DIRECT disk reads: 9398 MB in  3.00 seconds = 3131.64 MB/sec

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1953514/64/32, sectors = 4000797360, start = 0
 Timing cached reads:   62474 MB in  1.99 seconds = 31316.17 MB/sec
 Timing buffered disk reads: 10468 MB in  3.00 seconds = 3489.27 MB/sec

Did you tweak your readahead setting?

Here is what I’m getting on my gaming rig (i9-14900K) with Samsung 990 Pro 2TB:

$ sudo nvme id-ctrl /dev/nvme0n1 | grep -e "mn\s"
mn        : Samsung SSD 990 PRO 2TB                 
$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1907729/64/32, sectors = 3907029168, start = 0
 Timing O_DIRECT cached reads:   6878 MB in  2.00 seconds = 3443.73 MB/sec
 Timing O_DIRECT disk reads: 5314 MB in  3.00 seconds = 1771.27 MB/sec

$ sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1907729/64/32, sectors = 3907029168, start = 0
 Timing cached reads:   49876 MB in  2.00 seconds = 24982.13 MB/sec
 Timing buffered disk reads: 8176 MB in  3.00 seconds = 2725.10 MB/sec

a stock silverblue 42.

             .',;::::;,'.                 philou@max
         .';:cccccccccccc:;,.             ----------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 42.20251023.0 (Silverblue) x86_64
    .:cccccccccccccccccccccccccc:.        Host: Desktop (AMD Ryzen AI Max 300 Series) (A6)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.4-200.fc42.x86_64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 11 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 1583 (rpm), 20 (flatpak)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.2.37
:cccccccccccccc;MMM.;cccccccccccccccc:    Terminal: /dev/pts/0
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    CPU: AMD RYZEN AI MAX+ 395 (32) @ 5.19 GHz
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    GPU: AMD Radeon 8060S Graphics [Integrated]
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    Memory: 2.31 GiB / 123.60 GiB (2%)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Swap: 0 B / 8.00 GiB (0%)
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Disk (/): 21.12 MiB / 21.12 MiB (100%) - overlay [Read-only]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Disk (/etc): 1.12 TiB / 3.64 TiB (31%) - btrfs
cccccccc;.:odl:.;cccccccccccccc:,.        Local IP (enp191s0): 192.168.1.105/24
ccccccccccccccccccccccccccccc:'.          Locale: fr_FR.UTF-8
:ccccccccccccccccccccccc:;,..
 ':cccccccccccccccc::;,.                                          
                                                                  

but on my framework 16 with fedora 42 (update from fc41…) with extension baie:

             .',;::::;,'.                 zzzzzzz@framework
         .';:cccccccccccc:;,.             ----------------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 42 (Workstation Edition) x86_64
    .:cccccccccccccccccccccccccc:.        Host: Laptop 16 (AMD Ryzen 7040 Series) (A9)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.4-200.fc42.x86_64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 6 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 2471 (rpm), 13 (flatpak)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.2.37
:cccccccccccccc;MMM.;cccccccccccccccc:    Display (BOE0BC9): 2560x1600 @ 165 Hz (as 1464x915) in 16" [Built-in]
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    DE: GNOME 48.5
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    WM: Mutter (Wayland)
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    WM Theme: Adwaita
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Theme: Adwaita [GTK2/3/4]
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Icons: Adwaita [GTK2/3/4]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Font: Adwaita Sans (11pt) [GTK2/3/4]
cccccccc;.:odl:.;cccccccccccccc:,.        Cursor: Adwaita (24px)
ccccccccccccccccccccccccccccc:'.          Terminal: Ptyxis 48.5
:ccccccccccccccccccccccc:;,..             Terminal Font: Adwaita Mono (11pt)
 ':cccccccccccccccc::;,.                  CPU: AMD Ryzen 9 7940HS (16) @ 5.26 GHz
                                          GPU: AMD Radeon 780M Graphics [Integrated]
                                          Memory: 8.57 GiB / 123.62 GiB (7%)
                                          Swap: 0 B / 128.00 GiB (0%)
                                          Disk (/): 865.21 GiB / 1.69 TiB (50%) - btrfs
                                          Disk (/home): 1.71 TiB / 3.64 TiB (47%) - btrfs
                                          Local IP (enp195s0f3u2u1): 192.168.1.38/24
                                          Battery (FRANDBA): 60% [AC Connected]
                                          Locale: fr_FR.UTF-8
mn        : Samsung SSD 990 PRO 2TB                 => on MB M2 port
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1907729/64/32, sectors = 3907029168, start = 0
 Timing O_DIRECT cached reads:   6880 MB in  2.00 seconds = 3445.18 MB/sec
 Timing O_DIRECT disk reads: 10816 MB in  3.00 seconds = 3604.98 MB/sec
 Timing cached reads:   60602 MB in  1.98 seconds = 30601.95 MB/sec
 Timing buffered disk reads: 9622 MB in  3.00 seconds = 3207.25 MB/sec

mn        : Samsung SSD 990 PRO 4TB                 => on extention baie
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing O_DIRECT cached reads:   9024 MB in  2.00 seconds = 4519.88 MB/sec
 Timing O_DIRECT disk reads: 8042 MB in  3.00 seconds = 2680.30 MB/sec
 Timing cached reads:   62682 MB in  1.98 seconds = 31666.42 MB/sec
 Timing buffered disk reads: 6100 MB in  3.00 seconds = 2033.12 MB/sec

mn        : Samsung SSD 990 PRO 4TB                 => on extention baie
/dev/nvme2n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing O_DIRECT cached reads:   5952 MB in  2.00 seconds = 2980.18 MB/sec
 Timing O_DIRECT disk reads: 6910 MB in  3.00 seconds = 2302.84 MB/sec
 Timing cached reads:   62934 MB in  1.98 seconds = 31788.59 MB/sec
 Timing buffered disk reads: 4134 MB in  3.00 seconds = 1377.57 MB/sec

(for the desktop I connect avec ssh, for fw16 I use it many APP open :wink: )

Installing Fedora 43 Beta on my DGX Spark (on the external SSD drive for now). Let’s see if I’ll be able to make the CUDA part to work there :slight_smile:

Having said that, for most users it just doesn’t make sense, as stock Ubuntu is officially supported and works out of the box. But I’ve got to try, lol.

1 Like
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
		LnkSta:	Speed 16GT/s, Width x4
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
		LnkSta:	Speed 16GT/s, Width x2 (downgraded)
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
		LnkSta:	Speed 16GT/s, Width x4

look like 1 of my ssd on extension bay only use 2 lines … tha explain why it is slower than the other.. I still have to understand why…

Good luck installing Fedora.

on a “old” zen3 (AMD Ryzen 9 5950X)

=> PCIe v3.0
mn        : Samsung SSD 970 PRO 1TB                 
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 976762/64/32, sectors = 2000409264, start = 0
 Timing cached reads:   68860 MB in  1.98 seconds = 34806.92 MB/sec
 Timing buffered disk reads: 7656 MB in  3.00 seconds = 2551.63 MB/sec
 Timing O_DIRECT cached reads:   4084 MB in  2.00 seconds = 2042.77 MB/sec
 Timing O_DIRECT disk reads: 7062 MB in  3.00 seconds = 2353.74 MB/sec

mn        : Samsung SSD 980 1TB                     
 readonly      =  0 (off)
 readahead     = 512 (on)
 geometry      = 953869/64/32, sectors = 1953525168, start = 0
 Timing cached reads:   69306 MB in  1.98 seconds = 35034.47 MB/sec
 Timing buffered disk reads: 2284 MB in  3.00 seconds = 760.97 MB/sec
 Timing O_DIRECT cached reads:   5278 MB in  2.00 seconds = 2641.33 MB/sec
 Timing O_DIRECT disk reads: 3430 MB in  3.00 seconds = 1143.13 MB/sec


philou@serveur:~$ fastfetch
             .',;::::;,'.                 philou@serveur
         .';:cccccccccccc:;,.             --------------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 42 (Workstation Edition) x86_64
    .:cccccccccccccccccccccccccc:.        Kernel: Linux 6.16.10-200.fc42.x86_64
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Uptime: 7 mins
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Packages: 2457 (rpm)
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Shell: bash 5.2.37
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Terminal: /dev/pts/0
:cccccccccccccc;MMM.;cccccccccccccccc:    CPU: AMD Ryzen 9 5950X (32) @ 5.09 GHz
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    GPU: AMD Radeon RX 6900 XT [Discrete]
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    Memory: 3.87 GiB / 125.69 GiB (3%)
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    Swap: 0 B / 63.51 GiB (0%)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Disk (/): 388.71 GiB / 952.28 GiB (41%) - btrfs
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Disk (/stock/data): 1.19 TiB / 10.92 TiB (11%) - btrfs
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Disk (/tmp): 5.62 GiB / 868.00 GiB (1%) - f2fs
cccccccc;.:odl:.;cccccccccccccc:,.        Local IP (enp5s0): 192.168.1.104/24
ccccccccccccccccccccccccccccc:'.          Locale: fr_FR.UTF-8
:ccccccccccccccccccccccc:;,..
 ':cccccccccccccccc::;,.                                          
                                                             

2 different readahead .. ???

Well, well, well, have a look at that:

eugr@spark:~$ fastfetch
             .',;::::;,'.                 eugr@spark
         .';:cccccccccccc:;,.             ----------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 43 (KDE Plasma Desktop Edition) aarch64
    .:cccccccccccccccccccccccccc:.        Host: NVIDIA_DGX_Spark (A.7)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.1-300.fc43.aarch64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 22 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 2421 (rpm)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.3.0
:cccccccccccccc;MMM.;cccccccccccccccc:    Display (Unknown-1): 800x600 @ 60 Hz in 10"
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    DE: KDE Plasma 6.4.5
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    WM: KWin (Wayland)
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    WM Theme: Breeze
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Theme: Breeze (Light) [Qt], Breeze [GTK2/3]
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Icons: Breeze [Qt], breeze [GTK2/3/4]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Font: Noto Sans (10pt) [Qt], Noto Sans (10pt) [GTK2/3/4]
cccccccc;.:odl:.;cccccccccccccc:,.        Cursor: Breeze (24px)
ccccccccccccccccccccccccccccc:'.          Terminal: /dev/pts/4
:ccccccccccccccccccccccc:;,..             CPU: Cortex-A725*5 + Cortex-X925*5 + Cortex-A725*5 + Cortex-X925*5 (20) @ 3.90 GHz
 ':cccccccccccccccc::;,.                  GPU: NVIDIA Device 2E12 (VGA compatible)
                                          Memory: 4.37 GiB / 119.69 GiB (4%)
                                          Swap: 0 B / 8.00 GiB (0%)
                                          Disk (/): 20.17 GiB / 538.30 GiB (4%) - btrfs
                                          Local IP (enP7s7): 192.168.24.104/24
                                          Locale: en_US.UTF-8

                                                                  
eugr@spark:~$ nvidia-smi
Thu Oct 23 13:07:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    Off |   0000000F:01:00.0 Off |                  N/A |
| N/A   38C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
2 Likes

Getting somewhere!

eugr@spark:~/llama.cpp$ build/bin/llama-cli --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
Available devices:
  CUDA0: NVIDIA GB10 (122558 MiB, 117541 MiB free)

lstopo lstopo.png => for framework desktop!

(did not report the GPU…)

./build_ref/rocm/bin/llama-cli --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
  ROCm0: Radeon 8060S Graphics (63282 MiB, 63280 MiB free)

did you configure gtt or is it the defaut “VRAM” reported? (in my case default gtt…)

No, I didn’t configure anything, it just works. But getting slower performance than stock DGX OS:

eugr@spark:~/llama.cpp$ build/bin/llama-bench -m /run/media/eugr/root/home/eugr/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1822.91 ± 6.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 41.45 ± 0.14
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1710.41 ± 4.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 37.55 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1575.44 ± 15.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 36.15 ± 0.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1373.54 ± 3.41
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 34.05 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1072.37 ± 9.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 30.53 ± 0.03
build: 0bf47a1db (6829)

This is Spark: