Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC (MiniMax-M2 & GLM 4.6)

kyuz0 · November 9, 2025, 12:04pm

I finally did it! I built a Llama.cpp two-node cluster with my Framework Desktop and HP G1a Mini: 256GB of Unified Memory to host large models like the new MiniMax-M2 and GLM 4.6!

In this video I go through the setup (incredibly easy) and the performance evaluation.

Djip · November 9, 2025, 3:06pm

If only we can take time to add tensor parallelism on llama.cpp for multi-device config…
To minimize network throughput, it would be necessary to merge 2 matmul, therefore modifying the graph in a backend, and I’m not too sure what the right way to do it is.

For now make experiment with CPU and GPU with BF16 matmul backend…

Djip · November 9, 2025, 3:47pm

Have you try to use the Thunderbolt Networking in place of the RJ45 port?

kyuz0 · November 9, 2025, 4:00pm

Will try - I didn’t do it for the video because I erroneously thought the HP mini did not have that. I now checked and it does, so I will try and see what the difference is, and update with the performance results.

Djip · November 9, 2025, 4:27pm

For now with llama.cpp layered split it may have only a difference in loading time. But nice to know what we get.

FW4TeePee · November 9, 2025, 7:41pm

I have to ask about the noise from the HP. Is it as loud as the reviews suggest? Or have they been a bit harsh on the fans at full spin?

kyuz0 · November 9, 2025, 10:51pm

The HP can get very noisy, it’s tucked away in a closet. the Framework is so quiet that it sits in my living room as a piece of decor!

FW4TeePee · November 9, 2025, 11:02pm

When deliveries were uncertain I nearly bought one. Then read the reviews about the noise. Thanks for confirming … I feel better for having waited for the FW

Guest209 · December 23, 2025, 3:13am

Did you ever get this working?

Lorphos · December 23, 2025, 9:28am

I just ordered by second Strix Halo PC and a pair of USB4/TB3 40Gbps cables. Looking forward to experiment with this setup with large models like GLM 4.7 q4.

kyuz0 · December 23, 2025, 10:02am

Let me know how it goes. I know that some work went into llama.cpp RPC recently. I wasn’t able to make it perform much better, latency is more of an issue, but I’d love to see some performance improvemements.

Guest209 · December 23, 2025, 5:59pm

Just ordered a 2nd 395 128GB Desktop, and am also going to go the Thunderbolt networking route (in part because I need the Ethernet port to connect it to my network).

I’m not expecting a significant difference with TB off the bat, but I do hold out hope that the llama.cpp team are working on the RPC parallelism.

Peter_Drayton · December 24, 2025, 7:12am

I’m in process of building a 2-node FW desktop setup using a 2-in-1 rack mount chassis. It’s going to be interesting to see if I can get the cooling sorted, but it’s destined for a rack in the garage so server fans with high CFM & static pressure values are fully on the table.

The current idea is to use the x4 slot with a riser to a 2xSFP28 NIC (not a fan of the idea of point-to-point TB networking) but that stage depends on if I can get the two nodes stable in the case I’m targeting, before adding in a NIC with its accompanying cable routing issues, heat generation, reduced airflow, etc.

Given the findings from kyuz0 and others that latency is the killer for clustering these units, I wish the Mellanox InfiniBand NICs were more available / cheap enough to get some to test with.

Djip · December 24, 2025, 4:15pm

Did not see real work on that but we need it … juste not sure what to do.
I don’t think RPC is good for tensor parallelism we may need to use MPI or other lib that is made for that.

for now I play with backends … But I did not know for now if we can have good speed with only backend work…

entropy4936 · December 29, 2025, 12:25pm

I have two 395 128GB machines running llama.cpp via RPC. I started with Ethernet then switched to Thunderbolt networking. I honestly didn’t notice any difference between the two. There might be some slight improvement in the initial model loading time (I’m running GLM 4.6 most of the time) but as far as actually using the model, it doesn’t look like RPC actually uses more than 20MB/s bandwidth so Thunderbolt is definitely an overkill for it

kyuz0 · December 29, 2025, 12:42pm

Indeed, this is exactly what I found, as the issue is latency. But there are alternatives that I will hopefully be testing next month which might provide less latency and fully leverage two Strix Halos.

entropy4936 · December 29, 2025, 2:17pm

Would be interested to hear how that goes, as I’m currently looking to get a third one of these machines and having something better than RPC to connect all of them would be awesome

Djip · December 29, 2025, 6:25pm

We really need to add tensor parallelism to llama.cpp…
There is some idea Feature Request: Tensor Parallelism support · Issue #9086 · ggml-org/llama.cpp · GitHub

But for me for best perf this need ggml core change and frontend change.

I have play with fp8 backend, bf16, aocl (with repacking) backend RDNA3 wmma matmul OP and repacking, APU memory… for now I need to have more time with MPI… RDNA3 dual issue.. and may be with QK types…

But I may need to publish some work, and have feedback to have something useful…

Topic		Replies	Views
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	14881	September 29, 2025
LLM Performance Framework Desktop ai	26	7465	June 11, 2025
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	6897	February 2, 2026
Ryzen AI "Max" -- not so much? Framework Desktop	23	2006	December 2, 2025
AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 Framework Desktop framework-desktop-ai-max-300 , ai	18	5319	January 14, 2026

Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC (MiniMax-M2 & GLM 4.6)

Related topics