DGX Spark vs. Strix Halo - Initial Impressions

Eugr · October 22, 2025, 8:39pm

Just wanted to share my initial impressions after using both Strix Halo (GMKTek Evo x2 128GB) and NVidia DGX Spark as an AI developer.

I posted some of my experience with both platforms in Batch 13 thread, but decided to make a separate post.

Hardware

DGX Spark is probably the most minimalist mini-PC I’ve ever used.

It has absolutely no LEDs, not even in the LAN port, and on/off switch is a button, so unless you ping it over the network or hook up a display, good luck guessing if this thing is on.
All ports are in the back, there is no Display Port, only a single HDMI port, USB-C (power only), 3x USB-C 3.2 gen 2 ports, 10G ethernet port and 2x QSFP ports.

The air intake is in the front and exhaust is in the back. It is quiet for the most part, but the fan is quite audible when it’s on (but quieter than my GMKTek).

It has a single 4TB PciE 5.0x4 M.2 2242 SSD - SAMSUNG MZALC4T0HBL1-00B07 which I couldn’t find anywhere for sale in 2242 form factor, only 2280 version, but DGX Spark only takes 2242 drives. I wish they went with standard 2280 - weird decision, given that it’s a mini-PC, not a laptop or tablet. Who cares if the motherboard is an inch longer!

The performance seems good, and gives me 4240.64 MB/sec vs 3118.53 MB/sec on my GMKTek (as measured by hdparm).

It is user replaceable, but there is only one slot, accessible from the bottom of the device. You need to take the magnetic plate off and there are some access screws underneath.

The unit is made of metal, and gets quite hot during high loads, but not unbearable hot like some reviews mentioned. Cools down quickly, though (metal!).

The CPU is 20 core ARM with 10 performance and 10 efficiency cores. I didn’t benchmark them, but other reviews CPU show performance similar to Strix Halo.

Initial Setup

DGX Spark comes with DGX OS pre-installed (more on this later). You can set it up interactively using keyboard/mouse/display or in headless mode via WiFi hotspot that it creates.

I tried to set it up by connecting my trusted Logitech keyboard/trackpad combo that I use to set up pretty much all my server boxes, but once it booted up, it displayed “Connect the keyboard” message and didn’t let me proceed any further. Trackpad portion worked, and volume keys on the keyboard also worked! I rebooted, and was able to enter BIOS (by pressing Esc) just fine, and the keyboard was fully functioning there!

BTW, it has AMI BIOS, but doesn’t expose anything interesting other than networking and boot options.

Booting into DGX OS resulted in the same problem. After some googling, I figured that it shipped with a borked kernel that broke Logitech unified setups, so I decided to proceed in a headless mode.

Connected to the Wifi hotspot from my Mac (hotspot SSID/password are printed on a sticker on top of the quick start guide) and was able to continue set up there, which was pretty smooth, other than Mac spamming me with “connect to internet” popup every minute or so. It then proceeded to update firmware and OS packages, which took about 30 minutes, but eventually finished, and after that my Logitech keyboard worked just fine.

Linux Experience

DGX Spark runs DGX OS 7.2.3 which is based on Ubuntu 24.04.3 LTS, but uses NVidia’s custom kernel, and an older one than mainline Ubuntu LTS uses.
So instead of 6.14.x you get 6.11.0-1016-nvidia.

It comes with CUDA 13.0 development kit and NVidia drivers (580.95.05) pre-installed.
It also has NVidia’s container toolkit that includes docker, and GPU passthrough works well.

Other than that, it’s a standard Ubuntu Desktop installation, with GNOME and everything.

SSHd is enabled by default, so after headless install you can connect to it immediately without any extra configuration.

RDP remote desktop doesn’t work currently - it connects, but display output is broken.

I tried to boot from Fedora 43 Beta Live USB, and it worked, sort of. First, you need to disable Secure Boot in BIOS. Then, it boots only in “basic graphics mode”, because built-in nvidia drivers don’t recognize the chipset. It also throws other errors complaining about chipset, processor cores, etc.

I think I’ll try to install it to an external SSD and see if NVidia standard drivers will recognize the chip. There is hope:

==============

PLATFORM INFO:

IOMMU: Pass-through or enabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 13000
Platform: NVIDIA_DGX_Spark, Arch: aarch64(Linux 6.11.0-1016-nvidia)
Platform verification succeeded

As for Strix Halo, it’s an x86 PC, so you can run any distro you want. I chose Fedora 43 Beta, currently running with kernel 6.17.3-300.fc43.x86_64.
Smooth sailing, up-to-date packages.

Llama.cpp experience

DGX Spark

You need to build it from source as there is no CUDA ARM build, but compiling llama.cpp was very straightforward - CUDA toolkit is already installed, just need to install development tools and it compiles just like on any other system with NVidia GPU. Just follow the instructions, no surprises.

However, when I ran the benchmarks, I ran into two issues.

The model loading was VERY slow. It took 1 minute 40 seconds to load gpt-oss-120b. For comparison, it takes 22 seconds to load on Strix Halo (both from cold, memory cache flushed).
I wasn’t getting the same results as ggerganov in this thread. While PP was pretty impressive for such a small system, TG was matching or even slightly worse than my Strix Halo setup with ROCm.

For instance, here are my Strix Halo numbers, compiled with ROCm 7.10.0a20251017, llama.cpp build 03792ad9 (6816), HIP only, no rocWMMA:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	999.59 ± 4.31
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	47.49 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	824.37 ± 1.16
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	44.23 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	703.42 ± 1.54
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	42.52 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	514.89 ± 3.86
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	39.71 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	348.59 ± 2.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	35.39 ± 0.01
The same command on Spark gave me this:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048	1816.00 ± 11.21
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32	44.74 ± 0.99
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d4096	1763.75 ± 6.43
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d4096	42.69 ± 0.93
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d8192	1695.29 ± 11.56
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d8192	40.91 ± 0.35
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d16384	1512.65 ± 6.35
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d16384	38.61 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d32768	1250.55 ± 5.21
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d32768	34.66 ± 0.02
I tried enabling Unified Memory switch (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1) - it improved model loading, but resulted in even worse performance.

I reached out to ggerganov, and he suggested disabling mmap. I thought I tried it, but apparently not.
Well, that fixed it. Model loading improved too - now taking 56 seconds from cold and 23 seconds when it’s still in cache.

Updated numbers:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048	1939.32 ± 4.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32	56.33 ± 0.26
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d4096	1832.04 ± 5.58
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d4096	52.63 ± 0.12
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d8192	1738.07 ± 5.93
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d8192	48.60 ± 0.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d16384	1525.71 ± 12.34
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d16384	45.01 ± 0.09
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d32768	1242.35 ± 5.64
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d32768	39.10 ± 0.09
As you can see, much better performance both in PP and TG.

As for Strix Halo, mmap/no-mmap doesn’t make any difference there.

Strix Halo

On Strix Halo, llama.cpp experience is… well, a bit turbulent.

You can download a pre-built version for Vulkan, and it works, but the performance is a mixed bag. TG is pretty good, but PP is not great.

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 --mmap 0 -ngl 999 -ub 1024

NOTE: Vulkan likes batch size of 1024 the most, unlike ROCm that likes 2048 better.

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	pp2048	526.54 ± 4.90
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	tg32	52.64 ± 0.08
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	pp2048 @ d4096	438.85 ± 0.76
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	tg32 @ d4096	48.21 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	pp2048 @ d8192	356.28 ± 4.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	tg32 @ d8192	45.90 ± 0.23
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	pp2048 @ d16384	210.17 ± 2.53
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	tg32 @ d16384	42.64 ± 0.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	pp2048 @ d32768	138.79 ± 9.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan	tg32 @ d32768	36.18 ± 0.02

I tried toolboxes from kyuz0, and some of them were better, but I still felt that I could squeeze more juice out of it. All of them suffered from significant performance degradation when the context was filling up.

Then I tried to compile my own using the latest ROCm build from TheRock (on that date).

I also build rocWMMA as recommended by kyoz0 (more on that later).

Llama.cpp compiled without major issues - I had to configure the paths properly, but other than that, it just worked.
The PP increased dramatically, but TG decreased.

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2048	1030.71 ± 2.26
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	tg32	47.84 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2048 @ d4096	802.36 ± 6.96
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	tg32 @ d4096	39.09 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2048 @ d8192	615.27 ± 2.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	tg32 @ d8192	33.34 ± 0.05
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2048 @ d16384	409.25 ± 0.67
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	tg32 @ d16384	25.86 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2048 @ d32768	228.04 ± 0.44
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	tg32 @ d32768	18.07 ± 0.03

But the biggest issue is significant performance degradation with long context, much more than you’d expect.

Then I stumbled upon Lemonade SDK and their pre-built llama.cpp. Ran that one, and got much better results across the board. TG was still below Vulkan, but PP was decent and degradation wasn’t as bad:

model	size	params	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048	999.20 ± 3.44
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32	47.53 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d4096	826.63 ± 9.09
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d4096	44.24 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d8192	702.66 ± 2.15
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d8192	42.56 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d16384	505.85 ± 1.33
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d16384	39.82 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d32768	343.06 ± 2.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d32768	35.50 ± 0.02
So I looked at their compilation options and noticed that they build without rocWMMA. So, I did the same and got similar performance too!

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048	1000.93 ± 1.23
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32	47.46 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d4096	827.34 ± 1.99
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d4096	44.20 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d8192	701.68 ± 2.36
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d8192	42.39 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d16384	503.49 ± 0.90
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d16384	39.61 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	pp2048 @ d32768	344.36 ± 0.80
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	tg32 @ d32768	35.32 ± 0.01
So far that’s the best I could get from Strix Halo. It’s very usable for text generation tasks.

Also, wanted to touch multi-modal performance. That’s where Spark shines. I don’t have any specific benchmarks yet, but image processing is much faster on Spark than on Strix Halo, especially in vLLM.

VLLM Experience

Haven’t had a chance to do extensive testing here, but wanted to share some early thoughts.

DGX Spark

First, I tried to just build vLLM from the source as usual. The build was successful, but it failed with the following error: ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’

I decided not to spend too much time on this for now, and just launched vLLM container that NVidia provides through their Docker repository.
It is built for DGX Spark, so supports it out of the box.

However, it has version 0.10.1, so I wasn’t able to run Qwen3-VL there.

Now, they put the source code inside the container, but it wasn’t a git repository - probably contains some NVidia-specific patches - I’ll need to see if those could be merged into main vllm code.

So I just checked out vllm main branch and proceeded to build with existing pytorch as usual. This time I was able to run it and launch qwen3-vl models just fine.
Both dense and MOE work. I tried FP4 and AWQ quants - everything works, no need to disable CUDA graphs.

The performance is decent - I still need to run some benchmarks, but image processing is very fast.

Strix Halo

Unlike llama.cpp that just works, vLLM experience on Strix Halo is much more limited.

My goal was to run Qwen3-VL models that are not supported by llama.cpp yet, so I needed to build 0.11.0 or later. There are some existing containers/toolboxes for earlier versions, but I couldn’t use them.

So, I installed ROCm pyTorch libraries from TheRock, some patches from kyoz0 toolboxes to avoid amdsmi package crash, ROCm FlashAttention and then just followed vLLM standard installation instructions with existing pyTorch.

I was able to run Qwen3VL dense models with decent (for dense models) speeds, although initialization takes quite some time until you reduce -max-num-seqs to 1 and set tp 1.
The image processing is very slow though, much slower than llama.cpp for the same image, but the token generation is about what you’d expect from it.

Again, model loading is faster than Spark for some reason (I’d expect other way around given faster SSD in Spark and slightly faster memory).

I’m going to rebuild vLLM and re-test/benchmark later.

Some observations:

FP8 models don’t work - they hang on WARNING 10-22 12:55:04 [fp8_utils.py:785] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/eugr/vllm/vllm/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=2560,device_name=Radeon_8060S_Graphics,dtype=fp8_w8a8,block_shape=[128,128].json
You need to use --enforce-eager, as CUDA graphs crash vLLM. Sometimes it works, but mostly crashes.
Even with --enforce-eager, there are some HIP-related crashes here and there occasionally.
AWQ models work, both 4-bit and 8-bit, but only dense ones. AWQ MOE quants require Marlin kernel that is not available for ROCm.

Conclusion / TL;DR

Summary of my initial impressions:

DGX Spark is an interesting beast for sure.
- Limited extensibility - no USB-4, only one M.2 slot, and it’s 2242.
- But has 200Gbps network interface.
It’s a first generation of such devices, so there are some annoying bugs and incompatibilities.
Inference wise, the token generation is nearly identical to Strix Halo both in llama.cpp and vllm, but prompt processing is 2-5x higher than Strix Halo.
- Strix Halo performance in prompt processing degrades much faster with context.
- Image processing takes longer, especially with vLLM.
- Model loading into unified RAM is slower on DGX Spark for some reason, both in llama.cpp and vLLM.
Even though vLLM included gfx1151 in the supported configurations, it still requires some hacks to compile it.
- And even then, the experience is suboptimal. Initialization time is slow, it crashes, FP8 doesn’t work, AWQ for MOE doesn’t work.
If you are an AI developer who uses transformers/pyTorch or you need vLLM - you are better off with DGX Spark (or just a normal GPU build).
If you want a power-efficient inference server that can run gpt-oss and similar MOE at decent speeds, and don’t need to process images often, Strix Halo is the way to go.
If you want a general purpose machine, Strix Halo wins too.

Djip · October 22, 2025, 9:53pm

curious to have some bench with this CPU.

with the framework I have:

./llama-bench \
     -ctk bf16 -ctv bf16 \   (or -ctk q8_0 -ctv q8_0)
     -ub 4096 -fa "0,1" \
     -p "1,2,4,8,16,32,64,128,256,512,1024,2048,4096" -n 32 \
     -m ./Mistral-Small-2506-BF16.gguf  ( or openai_gpt-oss-120b-MXFP4.gguf, Mistral-Small-2506-Q6_K.gguf)

for llama.cpp

- backend: CPU
- threads: 16
- n_ubatch: 4096
- type_vk: bf16

model	size	params	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp1	27.89 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp2	32.43 ± 0.43
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp4	48.68 ± 1.32
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp8	64.52 ± 0.86
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp16	77.86 ± 1.88
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp32	89.48 ± 0.32
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp64	100.27 ± 0.29
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp128	103.49 ± 0.82
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp256	107.07 ± 0.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp512	108.83 ± 0.14
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp1024	105.36 ± 0.08
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp2048	96.97 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	pp4096	93.15 ± 0.16
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	0	tg32	27.91 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp1	28.68 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp2	39.44 ± 0.72
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp4	54.11 ± 1.14
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp8	74.42 ± 1.32
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp16	88.83 ± 3.93
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp32	95.95 ± 2.37
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp64	102.33 ± 0.98
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp128	107.60 ± 0.09
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp256	110.22 ± 0.19
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp512	107.88 ± 0.43
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp1024	104.05 ± 0.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp2048	92.59 ± 0.06
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	pp4096	71.20 ± 0.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	1	tg32	28.62 ± 0.03
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp1	2.53 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp2	4.76 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp4	9.43 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp8	18.33 ± 0.01
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp16	34.80 ± 0.01
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp32	60.04 ± 0.04
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp64	87.12 ± 0.99
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp128	90.08 ± 0.03
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp256	89.81 ± 0.05
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp512	96.45 ± 0.11
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp1024	93.82 ± 0.04
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp2048	89.46 ± 0.17
Mistral-Small BF16	43.91 GiB	23.57 B	0	pp4096	83.31 ± 1.16
Mistral-Small BF16	43.91 GiB	23.57 B	0	tg32	2.53 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp1	2.54 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp2	4.83 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp4	9.58 ± 0.01
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp8	18.70 ± 0.01
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp16	35.99 ± 0.01
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp32	63.10 ± 0.02
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp64	89.64 ± 0.32
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp128	89.56 ± 0.08
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp256	87.10 ± 0.78
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp512	72.53 ± 1.16
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp1024	58.33 ± 0.33
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp2048	43.48 ± 0.32
Mistral-Small BF16	43.91 GiB	23.57 B	1	pp4096	28.45 ± 0.25
Mistral-Small BF16	43.91 GiB	23.57 B	1	tg32	2.55 ± 0.00

for ik_llama.cpp it is much faster on FA and with quantized model:

- backend: CPU
- threads: 16
- n_ubatch: 4096

model	size	params	kv	fa	test	t/s
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp1	28.52 ± 0.02
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp2	41.77 ± 1.47
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp4	63.09 ± 4.71
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp8	92.49 ± 4.12
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp16	116.18 ± 4.06
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp32	149.21 ± 4.35
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp64	216.87 ± 5.22
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp128	278.34 ± 6.63
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp256	320.98 ± 1.92
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp512	332.53 ± 2.02
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp1024	306.91 ± 1.25
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp2048	228.94 ± 3.83
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	pp4096	200.31 ± 2.29
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	0	tg32	28.58 ± 0.03
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp1	28.67 ± 0.01
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp2	41.73 ± 1.88
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp4	64.59 ± 1.74
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp8	90.43 ± 6.76
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp16	116.63 ± 4.61
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp32	149.20 ± 3.57
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp64	220.87 ± 3.26
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp128	281.63 ± 10.09
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp256	335.13 ± 3.83
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp512	387.05 ± 4.75
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp1024	419.31 ± 4.89
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp2048	432.76 ± 2.65
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	pp4096	415.24 ± 1.01
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	bf16	1	tg32	28.73 ± 0.02
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp1	28.55 ± 0.04
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp2	42.46 ± 1.55
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp4	62.05 ± 6.83
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp8	92.30 ± 3.85
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp16	115.83 ± 3.81
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp32	149.40 ± 3.75
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp64	218.15 ± 5.20
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp128	288.04 ± 7.47
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp256	348.56 ± 1.85
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp512	390.93 ± 2.66
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp1024	432.22 ± 3.77
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp2048	449.12 ± 1.44
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	pp4096	432.58 ± 0.94
gpt-oss 120B MXFP4	59.02 GiB	116.83 B	q8_0	1	tg32	28.69 ± 0.04
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp1	2.43 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp2	4.83 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp4	9.60 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp8	16.40 ± 0.01
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp16	28.18 ± 0.02
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp32	48.80 ± 0.09
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp64	68.87 ± 1.03
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp128	78.35 ± 0.51
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp256	82.86 ± 0.82
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp512	81.20 ± 1.98
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp1024	79.87 ± 0.60
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp2048	76.31 ± 0.56
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	pp4096	71.55 ± 0.59
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	0	tg32	2.42 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp1	2.42 ± 0.00
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp2	4.81 ± 0.03
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp4	9.57 ± 0.05
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp8	16.42 ± 0.03
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp16	28.22 ± 0.08
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp32	48.70 ± 0.10
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp64	68.88 ± 1.37
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp128	78.62 ± 0.32
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp256	83.97 ± 1.29
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp512	81.39 ± 1.47
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp1024	79.39 ± 1.66
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp2048	78.16 ± 0.51
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	pp4096	77.21 ± 1.11
Mistral-Small BF16	43.91 GiB	23.57 B	bf16	1	tg32	2.43 ± 0.00
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp1	5.94 ± 0.03
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp2	11.51 ± 0.21
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp4	21.03 ± 0.27
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp8	29.97 ± 4.07
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp16	38.42 ± 0.04
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp32	43.22 ± 0.11
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp64	87.87 ± 0.83
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp128	112.88 ± 0.45
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp256	131.94 ± 0.53
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp512	143.66 ± 0.31
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp1024	147.06 ± 0.35
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp2048	144.41 ± 1.99
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	pp4096	141.16 ± 1.01
Mistral-Small Q6_K	18.31 GiB	23.57 B	bf16	1	tg32	5.96 ± 0.00
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp1	5.96 ± 0.00
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp2	11.60 ± 0.01
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp4	21.23 ± 0.02
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp8	31.81 ± 0.73
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp16	38.51 ± 0.07
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp32	43.33 ± 0.09
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp64	87.93 ± 0.91
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp128	112.92 ± 0.37
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp256	131.43 ± 0.29
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp512	144.43 ± 0.06
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp1024	148.03 ± 1.10
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp2048	146.93 ± 1.44
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	pp4096	142.78 ± 2.26
Mistral-Small Q6_K	18.31 GiB	23.57 B	q8_0	1	tg32	5.96 ± 0.00

(for the DGX Spark it may be faster with fp16 . )

Eugr · October 22, 2025, 10:12pm

It took forever, but here we go. I made sure that all GPU offload is off, but later recompiled llama.cpp with only GGML_NATIVE=ON and GGML_CUDA=OFF, and got the same results.

I confirmed that CPU was at 100% and GPU was at 0%.

Not great, but understandable, since there is no optimized CPU backend for this ARM CPU, it just uses generic one as opposed to AMD which can use AVX512, etc:

-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native

eugr@spark:~/llm/llama.cpp$ build/bin/llama-bench -ctk bf16 -ctv bf16 \
     -ub 4096 -fa "0,1"  \
     -p "1,2,4,8,16,32,64,128,256,512,1024,2048,4096" -n 32  \
     -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf  \
     -ngl 0 -nkvo 1 -nopo 1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp1	12.41 ± 1.80
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2	8.75 ± 2.21
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp4	17.06 ± 0.55
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp8	19.07 ± 2.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp16	21.17 ± 3.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp32	24.80 ± 2.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp64	30.61 ± 0.61
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp128	38.11 ± 1.99
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp256	45.45 ± 0.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp512	43.21 ± 0.06
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp1024	35.06 ± 0.05
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048	24.97 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp4096	21.57 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32	9.75 ± 1.05
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp1	13.01 ± 1.79
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2	11.17 ± 5.42
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp4	15.84 ± 3.96
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp8	20.77 ± 1.97
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp16	22.56 ± 4.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp32	25.66 ± 2.41
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp64	30.80 ± 0.39
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp128	33.21 ± 0.22
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp256	48.87 ± 0.54
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp512	53.08 ± 0.51
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp1024	52.24 ± 0.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048	47.27 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp4096	39.49 ± 0.16
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32	9.24 ± 0.81

build: 9285325c (6817)

Djip · October 22, 2025, 10:25pm

You can try with f16 for kv (and not bf16, ryzen have AVX512-BF16, but ARM have FP16 vector) and there is some spacial path for it the same way there is for x86.

github.com/ggml-org/llama.cpp

ggml/src/ggml-cpu/llamafile/sgemm.cpp

63d2fc46e


      
          #if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) && !defined(_MSC_VER)
          template <>
          inline float16x8_t madd(float16x8_t a, float16x8_t b, float16x8_t c) {
              return vfmaq_f16(c, b, a);
          }

what is the more “interensing” is

=> so the CPU RAM acces is 3x slower than for ryzen. that may be the point of slow model loading.

GitHub - ikawrakow/ik_llama.cpp: llama.cpp fork with additional SOTA quants and improved performance may have better perf too on this quantized model

Eugr · October 22, 2025, 10:42pm

It shouldn’t be. According to this analysis, memory controller is on CPU die:

The Blackwell iGPU die , as expected, is based on NVIDIA’s Blackwell GPU architecture. It doesn’t have dedicated memory of its own. Instead, it connects to the system’s LPDDR5X DRAM through memory controllers located in the MediaTek CPU die . The C2C interconnect provides around 600 GB/s of aggregate bandwidth , more than enough for the iGPU to access the full system memory bandwidth.

So, not sure what’s going on. Maybe kernel issues?

Djip · October 22, 2025, 10:58pm

aggregate for NVIDIA => in + out == 300GB/s of bandwidth. and it is from CPU part to GPU part. It doesn’t indicate what the CPU can use. (And it is use for cache coherence).

I am wrong…

Djip · October 23, 2025, 12:05am

on framwork with samsung 4To 990 pro I get:

zzzz@max:~/LLM$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 8192 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing O_DIRECT cached reads:   8360 MB in  2.00 seconds = 4180.61 MB/sec
 Timing O_DIRECT disk reads: 7964 MB in  3.00 seconds = 2654.63 MB/sec

zzzz@max:~/LLM$ sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 8192 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing cached reads:   76796 MB in  2.00 seconds = 38483.65 MB/sec
 Timing buffered disk reads: 18654 MB in  3.00 seconds = 6216.83 MB/sec

Eugr · October 23, 2025, 5:32am

Reran mine with your parameters:

DGX Spark (stock 4TB):

eugr@spark:~$ sudo nvme id-ctrl /dev/nvme0n1 | grep -e "mn\s"
mn        : SAMSUNG MZALC4T0HBL1-00B07

eugr@spark:~$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3907018/64/32, sectors = 8001573552, start = 0
 Timing O_DIRECT cached reads:   16652 MB in  2.00 seconds = 8338.15 MB/sec
 Timing O_DIRECT disk reads: 13506 MB in  3.00 seconds = 4501.88 MB/sec
 
eugr@spark:~$ sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3907018/64/32, sectors = 8001573552, start = 0
 Timing cached reads:   45626 MB in  1.99 seconds = 22871.34 MB/sec
 Timing buffered disk reads: 2664 MB in  3.00 seconds = 887.86 MB/sec

GMKTek Evo X2 (stock 2TB):

eugr@ai:~$ sudo nvme id-ctrl /dev/nvme0n1 | grep -e "mn\s"
mn        : Lexar SSD ARES 2TB

eugr@ai:~$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1953514/64/32, sectors = 4000797360, start = 0
 Timing O_DIRECT cached reads:   7324 MB in  2.00 seconds = 3662.68 MB/sec
 Timing O_DIRECT disk reads: 9398 MB in  3.00 seconds = 3131.64 MB/sec

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1953514/64/32, sectors = 4000797360, start = 0
 Timing cached reads:   62474 MB in  1.99 seconds = 31316.17 MB/sec
 Timing buffered disk reads: 10468 MB in  3.00 seconds = 3489.27 MB/sec

Eugr · October 23, 2025, 5:41am

Did you tweak your readahead setting?

Here is what I’m getting on my gaming rig (i9-14900K) with Samsung 990 Pro 2TB:

$ sudo nvme id-ctrl /dev/nvme0n1 | grep -e "mn\s"
mn        : Samsung SSD 990 PRO 2TB                 
$ sudo hdparm -Ttv --direct /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1907729/64/32, sectors = 3907029168, start = 0
 Timing O_DIRECT cached reads:   6878 MB in  2.00 seconds = 3443.73 MB/sec
 Timing O_DIRECT disk reads: 5314 MB in  3.00 seconds = 1771.27 MB/sec

$ sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1907729/64/32, sectors = 3907029168, start = 0
 Timing cached reads:   49876 MB in  2.00 seconds = 24982.13 MB/sec
 Timing buffered disk reads: 8176 MB in  3.00 seconds = 2725.10 MB/sec

Djip · October 23, 2025, 6:07pm

a stock silverblue 42.

             .',;::::;,'.                 philou@max
         .';:cccccccccccc:;,.             ----------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 42.20251023.0 (Silverblue) x86_64
    .:cccccccccccccccccccccccccc:.        Host: Desktop (AMD Ryzen AI Max 300 Series) (A6)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.4-200.fc42.x86_64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 11 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 1583 (rpm), 20 (flatpak)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.2.37
:cccccccccccccc;MMM.;cccccccccccccccc:    Terminal: /dev/pts/0
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    CPU: AMD RYZEN AI MAX+ 395 (32) @ 5.19 GHz
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    GPU: AMD Radeon 8060S Graphics [Integrated]
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    Memory: 2.31 GiB / 123.60 GiB (2%)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Swap: 0 B / 8.00 GiB (0%)
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Disk (/): 21.12 MiB / 21.12 MiB (100%) - overlay [Read-only]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Disk (/etc): 1.12 TiB / 3.64 TiB (31%) - btrfs
cccccccc;.:odl:.;cccccccccccccc:,.        Local IP (enp191s0): 192.168.1.105/24
ccccccccccccccccccccccccccccc:'.          Locale: fr_FR.UTF-8
:ccccccccccccccccccccccc:;,..
 ':cccccccccccccccc::;,.

Djip · October 23, 2025, 6:12pm

but on my framework 16 with fedora 42 (update from fc41…) with extension baie:

             .',;::::;,'.                 zzzzzzz@framework
         .';:cccccccccccc:;,.             ----------------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 42 (Workstation Edition) x86_64
    .:cccccccccccccccccccccccccc:.        Host: Laptop 16 (AMD Ryzen 7040 Series) (A9)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.4-200.fc42.x86_64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 6 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 2471 (rpm), 13 (flatpak)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.2.37
:cccccccccccccc;MMM.;cccccccccccccccc:    Display (BOE0BC9): 2560x1600 @ 165 Hz (as 1464x915) in 16" [Built-in]
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    DE: GNOME 48.5
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    WM: Mutter (Wayland)
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    WM Theme: Adwaita
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Theme: Adwaita [GTK2/3/4]
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Icons: Adwaita [GTK2/3/4]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Font: Adwaita Sans (11pt) [GTK2/3/4]
cccccccc;.:odl:.;cccccccccccccc:,.        Cursor: Adwaita (24px)
ccccccccccccccccccccccccccccc:'.          Terminal: Ptyxis 48.5
:ccccccccccccccccccccccc:;,..             Terminal Font: Adwaita Mono (11pt)
 ':cccccccccccccccc::;,.                  CPU: AMD Ryzen 9 7940HS (16) @ 5.26 GHz
                                          GPU: AMD Radeon 780M Graphics [Integrated]
                                          Memory: 8.57 GiB / 123.62 GiB (7%)
                                          Swap: 0 B / 128.00 GiB (0%)
                                          Disk (/): 865.21 GiB / 1.69 TiB (50%) - btrfs
                                          Disk (/home): 1.71 TiB / 3.64 TiB (47%) - btrfs
                                          Local IP (enp195s0f3u2u1): 192.168.1.38/24
                                          Battery (FRANDBA): 60% [AC Connected]
                                          Locale: fr_FR.UTF-8
mn        : Samsung SSD 990 PRO 2TB                 => on MB M2 port
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1907729/64/32, sectors = 3907029168, start = 0
 Timing O_DIRECT cached reads:   6880 MB in  2.00 seconds = 3445.18 MB/sec
 Timing O_DIRECT disk reads: 10816 MB in  3.00 seconds = 3604.98 MB/sec
 Timing cached reads:   60602 MB in  1.98 seconds = 30601.95 MB/sec
 Timing buffered disk reads: 9622 MB in  3.00 seconds = 3207.25 MB/sec

mn        : Samsung SSD 990 PRO 4TB                 => on extention baie
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing O_DIRECT cached reads:   9024 MB in  2.00 seconds = 4519.88 MB/sec
 Timing O_DIRECT disk reads: 8042 MB in  3.00 seconds = 2680.30 MB/sec
 Timing cached reads:   62682 MB in  1.98 seconds = 31666.42 MB/sec
 Timing buffered disk reads: 6100 MB in  3.00 seconds = 2033.12 MB/sec

mn        : Samsung SSD 990 PRO 4TB                 => on extention baie
/dev/nvme2n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3815447/64/32, sectors = 7814037168, start = 0
 Timing O_DIRECT cached reads:   5952 MB in  2.00 seconds = 2980.18 MB/sec
 Timing O_DIRECT disk reads: 6910 MB in  3.00 seconds = 2302.84 MB/sec
 Timing cached reads:   62934 MB in  1.98 seconds = 31788.59 MB/sec
 Timing buffered disk reads: 4134 MB in  3.00 seconds = 1377.57 MB/sec

(for the desktop I connect avec ssh, for fw16 I use it many APP open )

Eugr · October 23, 2025, 6:37pm

Installing Fedora 43 Beta on my DGX Spark (on the external SSD drive for now). Let’s see if I’ll be able to make the CUDA part to work there

Having said that, for most users it just doesn’t make sense, as stock Ubuntu is officially supported and works out of the box. But I’ve got to try, lol.

Djip · October 23, 2025, 7:07pm

		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
		LnkSta:	Speed 16GT/s, Width x4
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
		LnkSta:	Speed 16GT/s, Width x2 (downgraded)
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
		LnkSta:	Speed 16GT/s, Width x4

look like 1 of my ssd on extension bay only use 2 lines … tha explain why it is slower than the other.. I still have to understand why…

Good luck installing Fedora.

Djip · October 23, 2025, 7:26pm

on a “old” zen3 (AMD Ryzen 9 5950X)

=> PCIe v3.0
mn        : Samsung SSD 970 PRO 1TB                 
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 976762/64/32, sectors = 2000409264, start = 0
 Timing cached reads:   68860 MB in  1.98 seconds = 34806.92 MB/sec
 Timing buffered disk reads: 7656 MB in  3.00 seconds = 2551.63 MB/sec
 Timing O_DIRECT cached reads:   4084 MB in  2.00 seconds = 2042.77 MB/sec
 Timing O_DIRECT disk reads: 7062 MB in  3.00 seconds = 2353.74 MB/sec

mn        : Samsung SSD 980 1TB                     
 readonly      =  0 (off)
 readahead     = 512 (on)
 geometry      = 953869/64/32, sectors = 1953525168, start = 0
 Timing cached reads:   69306 MB in  1.98 seconds = 35034.47 MB/sec
 Timing buffered disk reads: 2284 MB in  3.00 seconds = 760.97 MB/sec
 Timing O_DIRECT cached reads:   5278 MB in  2.00 seconds = 2641.33 MB/sec
 Timing O_DIRECT disk reads: 3430 MB in  3.00 seconds = 1143.13 MB/sec


philou@serveur:~$ fastfetch
             .',;::::;,'.                 philou@serveur
         .';:cccccccccccc:;,.             --------------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 42 (Workstation Edition) x86_64
    .:cccccccccccccccccccccccccc:.        Kernel: Linux 6.16.10-200.fc42.x86_64
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Uptime: 7 mins
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Packages: 2457 (rpm)
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Shell: bash 5.2.37
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Terminal: /dev/pts/0
:cccccccccccccc;MMM.;cccccccccccccccc:    CPU: AMD Ryzen 9 5950X (32) @ 5.09 GHz
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    GPU: AMD Radeon RX 6900 XT [Discrete]
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    Memory: 3.87 GiB / 125.69 GiB (3%)
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    Swap: 0 B / 63.51 GiB (0%)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Disk (/): 388.71 GiB / 952.28 GiB (41%) - btrfs
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Disk (/stock/data): 1.19 TiB / 10.92 TiB (11%) - btrfs
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Disk (/tmp): 5.62 GiB / 868.00 GiB (1%) - f2fs
cccccccc;.:odl:.;cccccccccccccc:,.        Local IP (enp5s0): 192.168.1.104/24
ccccccccccccccccccccccccccccc:'.          Locale: fr_FR.UTF-8
:ccccccccccccccccccccccc:;,..
 ':cccccccccccccccc::;,.

2 different readahead .. ???

Eugr · October 23, 2025, 8:08pm

Well, well, well, have a look at that:

eugr@spark:~$ fastfetch
             .',;::::;,'.                 eugr@spark
         .';:cccccccccccc:;,.             ----------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 43 (KDE Plasma Desktop Edition) aarch64
    .:cccccccccccccccccccccccccc:.        Host: NVIDIA_DGX_Spark (A.7)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.1-300.fc43.aarch64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 22 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 2421 (rpm)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.3.0
:cccccccccccccc;MMM.;cccccccccccccccc:    Display (Unknown-1): 800x600 @ 60 Hz in 10"
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    DE: KDE Plasma 6.4.5
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    WM: KWin (Wayland)
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    WM Theme: Breeze
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     Theme: Breeze (Light) [Qt], Breeze [GTK2/3]
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Icons: Breeze [Qt], breeze [GTK2/3/4]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Font: Noto Sans (10pt) [Qt], Noto Sans (10pt) [GTK2/3/4]
cccccccc;.:odl:.;cccccccccccccc:,.        Cursor: Breeze (24px)
ccccccccccccccccccccccccccccc:'.          Terminal: /dev/pts/4
:ccccccccccccccccccccccc:;,..             CPU: Cortex-A725*5 + Cortex-X925*5 + Cortex-A725*5 + Cortex-X925*5 (20) @ 3.90 GHz
 ':cccccccccccccccc::;,.                  GPU: NVIDIA Device 2E12 (VGA compatible)
                                          Memory: 4.37 GiB / 119.69 GiB (4%)
                                          Swap: 0 B / 8.00 GiB (0%)
                                          Disk (/): 20.17 GiB / 538.30 GiB (4%) - btrfs
                                          Local IP (enP7s7): 192.168.24.104/24
                                          Locale: en_US.UTF-8

                                                                  
eugr@spark:~$ nvidia-smi
Thu Oct 23 13:07:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    Off |   0000000F:01:00.0 Off |                  N/A |
| N/A   38C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Eugr · October 23, 2025, 8:45pm

Getting somewhere!

eugr@spark:~/llama.cpp$ build/bin/llama-cli --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
Available devices:
  CUDA0: NVIDIA GB10 (122558 MiB, 117541 MiB free)

Djip · October 23, 2025, 8:52pm

lstopo lstopo.png => for framework desktop!

(did not report the GPU…)

Djip · October 23, 2025, 8:55pm

./build_ref/rocm/bin/llama-cli --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
  ROCm0: Radeon 8060S Graphics (63282 MiB, 63280 MiB free)

did you configure gtt or is it the defaut “VRAM” reported? (in my case default gtt…)

Eugr · October 23, 2025, 9:06pm

No, I didn’t configure anything, it just works. But getting slower performance than stock DGX OS:

eugr@spark:~/llama.cpp$ build/bin/llama-bench -m /run/media/eugr/root/home/eugr/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048	1822.91 ± 6.43
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32	41.45 ± 0.14
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d4096	1710.41 ± 4.74
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d4096	37.55 ± 0.06
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d8192	1575.44 ± 15.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d8192	36.15 ± 0.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d16384	1373.54 ± 3.41
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d16384	34.05 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d32768	1072.37 ± 9.58
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d32768	30.53 ± 0.03

build: 0bf47a1db (6829)

Eugr · October 23, 2025, 9:33pm

This is Spark:

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	14951	September 29, 2025
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	6968	February 2, 2026
Ryzen AI "Max" -- not so much? Framework Desktop	23	2019	December 2, 2025
[HOW-TO] Compiling VLLM from source on Strix Halo Framework Desktop ai	59	4766	January 7, 2026
[TRACKING] Request: verify dGPU support Framework Desktop compatibility	209	10313	February 4, 2026

DGX Spark vs. Strix Halo - Initial Impressions

Hardware

Initial Setup

Linux Experience

==============

Llama.cpp experience

DGX Spark

Strix Halo

VLLM Experience

DGX Spark

Strix Halo

Conclusion / TL;DR

Related topics