Low-Latency Strix Halo Cluster with RDMA (RoCE/Intel E810) and vLLM, Framework Desktop Boards

I built a 2-node Strix Halo/Framework Desktop setup using RDMA (RoCE/Intel E810) and vLLM tensor parallelism. This took a fair amount of digging into ROCm and RCCL to get low-latency multi-node inference working on gfx1151. The video walks through the setup, what I learnt and how you can do something similar.

Thank you Framework for providing the two boards and one of the cards!

4 Likes

Very nice to see this go live!

WTB Framework support offering a “cut open my PCIe slot” service. I’d happily roundtrip my boards back to FW for this, and pay for it too. It is (theoretically) the only thing holding back my version of your tests, but where my tests are being done with Mellanox cards so that I can get valid contrasting numbers for RoCEv2 vs Infiniband on the same hardware & software setup.

1 Like

I mean, I would just go with a 16x to 4x adapter! Somebody should just make a 4x PCIe consumer network card that supports RDMA with low latency and ~10-20Gbps, if that’s reasonably priced, people would probably buy it! The issue is that here we’re buying cards meant for datacenters at prices that are not very consumer friendly!

There is a few 2x25 Gbps that have Pcie 4.0 connection: ( E810-XXVDA2 ?)

Another possibility would be to find a card that supports PCIe 4.0 NTB, like:

it use “direct” pcie connect… but can find cost / availability …

1 Like

@kyuz0 maybe i wasn’t clear, my issue is specifically around the combination of Mellanox cards and the Framework motherboard.

wrt the Intel cards - I have multiple pairs of them, both the 1-port E810-CQDA1 and the 2-port E810-CQDA2, they all perform just as they should. Meaning that the cards (which are PCIe 4.0 x16) happily negotiate down to PCIe 4.0 x4 in the Framework desktop, and I see the expected >50Gbps throughput and sub-5us latency. If the Infiniband experiment fails I’ll use these.

BUT, with the Mellanox cards (which are also PCIe 4.0 x16), they point blank refuse to negotiate stably down to PCIe 4.0 x4. Instead, they drop all the way down to PCIe 3.0 x4. At that point I see the expected 28Gbps throughtput and sub-1us latency (Infiniband). This very low latency number is what has kept me exploring the problem.

Apparently the MCX cards have very strict timing requirements and tax the PCI slot a lot compared to the Intel cards. Internet pundits point at the issue of signal degradation in the path between PCI slot and NIC. Which isn’t just “card in slot” becasue of the x4 slot being close-ended.

Anyway, typical solutions involve rigid adapter (as you suggested), short PCIe riser cables, Oculink, MCIO with redrivers, and of course powered risers. I’m slowly working my way through the solution matrix but I can say that so far the short PCIe riser cables, Oculink cards and powered risers have not solved it.

Again, if I could just plug the damn card into the motherboard, this would have been a 15-minute quest, not a 2-week exercise. It has been instructive, I’ve learned about some more esoteric solutions like MCIO redrivers, and the ultrasonic knife is looking intriguing. IDK if the Framework board is capable of negotiating Gen4 with MCX Connect-X 5 cards. Wierd since the FWD handles the Intel cards just fine.

Ideally someone with a FWD and some CX5 cards would post “hey this combination worked” and I’d just copy them. Until then I keep trying alternatives (all hail the Amazon return policy!)

1 Like

I thought the Connect-X 5 was only Gen 3.
You need Connect-X 6 to get Gen 4?

( Carte d'Adaptateur NVIDIA Mellanox ConnectX-6 Lx EN MCX631102AN-ADAT, SFP28 25GbE Ă  Double Port, PCIe 4.0 x 8, Support Haut - FS.com Europe )

Or did I miss something?
:crossed_fingers:

Connect-X 5 cards come in many flavors, the “Ex” models are PCIe 4.0. The ones I’m testing are model # MCX556A-EDAT which Nvidia specs say is both “Ex” (PCIe 4.0 x16) and “VPI” (ports configurable as IB or ETH).

I realized that I’m missing a spot in the test matrix - testing Intel + MCX5 in an entirely unrelated PC (not Strix Halo), one with an actual PCIe 4.0 x16 slot. Maybe my cards are somehow … just not PCIe 4.0 despite the model #? For completeness I have to test this. Ugh.

hop you have some success
:crossed_fingers: