I finally did it! I built a Llama.cpp two-node cluster with my Framework Desktop and HP G1a Mini: 256GB of Unified Memory to host large models like the new MiniMax-M2 and GLM 4.6!
In this video I go through the setup (incredibly easy) and the performance evaluation.
If only we can take time to add tensor parallelism on llama.cpp for multi-device config…
To minimize network throughput, it would be necessary to merge 2 matmul, therefore modifying the graph in a backend, and I’m not too sure what the right way to do it is.
For now make experiment with CPU and GPU with BF16 matmul backend…
Will try - I didn’t do it for the video because I erroneously thought the HP mini did not have that. I now checked and it does, so I will try and see what the difference is, and update with the performance results.
When deliveries were uncertain I nearly bought one. Then read the reviews about the noise. Thanks for confirming … I feel better for having waited for the FW