Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC (MiniMax-M2 & GLM 4.6)

I finally did it! I built a Llama.cpp two-node cluster with my Framework Desktop and HP G1a Mini: 256GB of Unified Memory to host large models like the new MiniMax-M2 and GLM 4.6!

In this video I go through the setup (incredibly easy) and the performance evaluation.

9 Likes

If only we can take time to add tensor parallelism on llama.cpp for multi-device config…
To minimize network throughput, it would be necessary to merge 2 matmul, therefore modifying the graph in a backend, and I’m not too sure what the right way to do it is.

For now make experiment with CPU and GPU with BF16 matmul backend…

1 Like

Have you try to use the Thunderbolt Networking in place of the RJ45 port?

Will try - I didn’t do it for the video because I erroneously thought the HP mini did not have that. I now checked and it does, so I will try and see what the difference is, and update with the performance results.

1 Like

For now with llama.cpp layered split it may have only a difference in loading time. But nice to know what we get.

I have to ask about the noise from the HP. Is it as loud as the reviews suggest? Or have they been a bit harsh on the fans at full spin?

The HP can get very noisy, it’s tucked away in a closet. the Framework is so quiet that it sits in my living room as a piece of decor!

1 Like

When deliveries were uncertain I nearly bought one. Then read the reviews about the noise. Thanks for confirming … I feel better for having waited for the FW

Did you ever get this working?

I just ordered by second Strix Halo PC and a pair of USB4/TB3 40Gbps cables. Looking forward to experiment with this setup with large models like GLM 4.7 q4.

Let me know how it goes. I know that some work went into llama.cpp RPC recently. I wasn’t able to make it perform much better, latency is more of an issue, but I’d love to see some performance improvemements.

Just ordered a 2nd 395 128GB Desktop, and am also going to go the Thunderbolt networking route (in part because I need the Ethernet port to connect it to my network).

I’m not expecting a significant difference with TB off the bat, but I do hold out hope that the llama.cpp team are working on the RPC parallelism.

I’m in process of building a 2-node FW desktop setup using a 2-in-1 rack mount chassis. It’s going to be interesting to see if I can get the cooling sorted, but it’s destined for a rack in the garage so server fans with high CFM & static pressure values are fully on the table.

The current idea is to use the x4 slot with a riser to a 2xSFP28 NIC (not a fan of the idea of point-to-point TB networking) but that stage depends on if I can get the two nodes stable in the case I’m targeting, before adding in a NIC with its accompanying cable routing issues, heat generation, reduced airflow, etc.

Given the findings from kyuz0 and others that latency is the killer for clustering these units, I wish the Mellanox InfiniBand NICs were more available / cheap enough to get some to test with. :frowning:

1 Like

Did not see real work on that but we need it … juste not sure what to do.
I don’t think RPC is good for tensor parallelism we may need to use MPI or other lib that is made for that.

for now I play with backends … But I did not know for now if we can have good speed with only backend work…

I have two 395 128GB machines running llama.cpp via RPC. I started with Ethernet then switched to Thunderbolt networking. I honestly didn’t notice any difference between the two. There might be some slight improvement in the initial model loading time (I’m running GLM 4.6 most of the time) but as far as actually using the model, it doesn’t look like RPC actually uses more than 20MB/s bandwidth so Thunderbolt is definitely an overkill for it

1 Like

Indeed, this is exactly what I found, as the issue is latency. But there are alternatives that I will hopefully be testing next month which might provide less latency and fully leverage two Strix Halos.

1 Like

Would be interested to hear how that goes, as I’m currently looking to get a third one of these machines and having something better than RPC to connect all of them would be awesome

We really need to add tensor parallelism to llama.cpp…
There is some idea Feature Request: Tensor Parallelism support · Issue #9086 · ggml-org/llama.cpp · GitHub

But for me for best perf this need ggml core change and frontend change.

I have play with fp8 backend, bf16, aocl (with repacking) backend RDNA3 wmma matmul OP and repacking, APU memory… for now I need to have more time with MPI… RDNA3 dual issue.. and may be with QK types…

But I may need to publish some work, and have feedback to have something useful…

2 Likes