Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC (MiniMax-M2 & GLM 4.6)

I finally did it! I built a Llama.cpp two-node cluster with my Framework Desktop and HP G1a Mini: 256GB of Unified Memory to host large models like the new MiniMax-M2 and GLM 4.6!

In this video I go through the setup (incredibly easy) and the performance evaluation.

6 Likes

If only we can take time to add tensor parallelism on llama.cpp for multi-device config…
To minimize network throughput, it would be necessary to merge 2 matmul, therefore modifying the graph in a backend, and I’m not too sure what the right way to do it is.

For now make experiment with CPU and GPU with BF16 matmul backend…

1 Like

Have you try to use the Thunderbolt Networking in place of the RJ45 port?

Will try - I didn’t do it for the video because I erroneously thought the HP mini did not have that. I now checked and it does, so I will try and see what the difference is, and update with the performance results.

1 Like

For now with llama.cpp layered split it may have only a difference in loading time. But nice to know what we get.

I have to ask about the noise from the HP. Is it as loud as the reviews suggest? Or have they been a bit harsh on the fans at full spin?

The HP can get very noisy, it’s tucked away in a closet. the Framework is so quiet that it sits in my living room as a piece of decor!

1 Like

When deliveries were uncertain I nearly bought one. Then read the reviews about the noise. Thanks for confirming … I feel better for having waited for the FW