What PCIE network card for 40+ GB/s

Hi guys, I’m looking for some suggestions for how I can cluster multiple boards for 40+ GB/s network speed. I’ve heard of infiniband stuff or old network cards but I see they are PCIE 3x16. Are they compatible?

What do you think is a good solution? I would appreciate some models/links

Thanks

Something like that https://www.fs.com/fr/products/147578.html
(there is some other…)
it is a dual 25GB SFP28 card that have PCIe-4.0…

but:

  • it need aggregation on the 2 port for more than 40+ GB/s
  • we need to find a way to connect it to the Pcie x4 . (I don’t know if the available power is sufficient. It is not on most GPU card)

If you are looking for 40Gbps NIC you will need to look at options from Mellanox (Nvidia now), Xilinx, or Intel. I am partial to the old Mellanox ConnectX brand as I used them in old high-performance compute clusters. I can’t image how expensive they would be be these days. Not sure if there are less expensive brands.

I keep looking around for specifics but nobody seems to have tried something that works. Generalities do not help. So, does anyone have a working solution for at least 20+gb?

Let’s look for a concrete solution. I know generalities myself too, I looked at all the yt videos and forums.

Looking forward to someone that knows :).

Doing more research, I found it confusing at least:

  1. Can we connect the USB-C ports with 40 gig compatible USB-C cables? –> no network card connection just normal USB-C ports? and config linux to work?

  2. Buy a PCIE 4 card like this one? (for both motherboards in the cluste? USB4 PCIe Gen4 Card|Motherboards|ASUS Global
    (1 for each motherboard in the cluster, connect through USB-C and then get 20+ GBPS? –> is this necessary if we already have the USB-C ports?)

  3. M2 adapter to thunderbolt for both motherboards –> then connect them together?

I’m amazed no one has currently shared a cluster (I ordered 2 motherboards and I am waiting for them so I am preparing to cluster them for 20+ GBPS speeds but no one clearly shows how to do it or tells).

Come on guys :smiley: . Let’s come up with smth

Yes, you can. You need a Thunderbolt 4/5 cable but Linux has support for Thunderbolt networking.
You don’t need a separate adapter. The USB4 ports on the back would also allow you to use this networking stack. Theoretically, you shouldn’t need to do anything besides just connecting the two systems with a Thunderbolt cable (at least for me it was plug and play)

I did a lot of research on this. I think the main restrictions (not to say they can’t be overcome) are

  1. Cable quality- you need an ACTIVE cable
  2. Network instability- you need to be REALLY careful since tb can turn off the power, and doesn’t work well from suspend
  3. Kind of oldish (its a feature of tb3)

Having said that, once I have another things that CAN DO it, I think that it might be fun. I would recommend using ipv6. There was a person who did it with proxmox ve, and found that ipv6 was the most stable.

Thanks. Finally smth actionable.

Got 2 boards w 600w platinum PSU’s.

Do you have a recommendation for a specific active thunderbolt cable?

Network wise, I got a 1gbps internet connection.

Looking forward as I want to do a proper cluster.

I am rather lucky since I work at a place that has TONS of old caldigit docs. I use the 3 foot cable from caldigit. I suspect if you have any old tb3 (not sure about this) from apple the might work too. DO NOT use a charging cable, or the generic ones. The important thing is to see the “thunderbolt’ bolt and have something reasonable (intel certified). Please note that I haven’t actually done this yet! Here is a product google recommended (based upon a query I did). I suspect a search on ebay for old active tb4 (not sure about tb3!) or tb4 might be worth a try if you are strapped for cash (which I suspect you probably arent since you have a framework!). The logo is the most important thing, it means its certified (keep it short too, long=instability). People forget that 40gbps is a PCIE interface in a cable!

1 Like

I use this for connecting my Minisforum PCs as well as Framework Desktop:

And yes, @Thomas_Munn is right, I didn’t mention that it should be an active cable in my original response, my bad. I kind of forgot that charging-only cables exist, to be honest

Thought, from my experience so far, using llama.cpp + RPC + Thunderbolt networking didn’’t really give me much of an improvement over gigabit Ethernet connection

For now did not see thunderbird speed on IA-MAX that is more than 10Gb/s… did not know if we really can have 40Gb with data (is it only available for display?)

Now for llama.cpp it did not need hight speed network for RPC. If you have 2 PC, the layers are split across the two machines, resulting in minimal data transfer. There’s no tensor parallel processing, only a serial connection of the layers. The biggest transfer, if I understand correctly, is sending the graph to be calculated.

To truly leverage both machines, the calculations need to be panellized, not just distributed across the layers.

Now for high speed network, the best I can find is tu use a dual 25Gbs network (ie 50Gb on agregation).

The question is can it work.
Dual 25Gbs network exist with intel E810 or NVIDA. It use 8xPcie 4.0 line. look that it can have full speed on 8xPCie3.0 or 4xPCIe4.0… To use it we need a Pcie x4 to Pcie x8 cable… and test if it work. Some test on 16xGPU look it did not work because of to low power from MB… (If I understand correctly)… now a standart PCIe x4 have a 25W spec. Look like the intel E810 is ~20W… so it may work…
Next we need to find how to configure the card for aggregation/rdna … For I am not sure what we need to do…
What we need is someone to test it… and a linux network export to configure it… if possible…

And at the end for llama.cpp we need to add tensor parallelism with MPI to use rdna low latency function… :wink:

(Carte Réseau Ethernet Intel E810-XXVAM2, double port SFP28 25G, PCIe 4.0 x8, comparable à l'Intel E810-XXVDA2, profil bas et pleine hauteur - FS.com Europe + 0,5m (2ft) Câble à Attache Directe Twinax en Cuivre Passif SFP28 25G Compatible Intel - FS.com Europe + GLOTRENDS 100 mm PCIe 4.0 X4 Riser Cable for M.2, WiFi, Firewire, USB, Sound Cards, etc : Amazon.fr: Computers ???)

Thanks! Pfiu, so now we’re looking at an activa thunderbolt cable. I found the: Cable Matters [Intel Certified] 40Gbps Thunderbolt 4 Cable 0.3 m with 8K Video and 240 W Charging which is intel certified.

Well, money wise I’m open to buy whatever needs to be bought to get 20+ gbps xD. Price does not really matter. Im in for the technology.

@entropy4936 thanks man, I’ll get that one.

@Djip So one strategy that needs to be tested is to get a PCIE 4x4 to 4x8 converter and get a card like E810.

So you basically need 2 x of those cards and 2 x of those coverters?

You already have a 40gbps usb4 interface on the back of the thing.

Yeah, that’s what I was trying to say in my original message here but I guess it didn’t come across properly - I’m not using any other additional cards or adapters, I have cable connecting directly to USB4 port at the back

yes, and 2 SFP28 cable connection :wink:

There is 2 of them :wink:
But can you share a linux config to have TCP over USB4 40Gb/s data exchange? for now did not see for this AMD APU more than 10Gb/s TCP benchmark result.
What bandwidth and latency are you getting? Were you able to aggregate the two ports?

Thank! I just bought a TB4 usbc cable from cablematters. WIll test and report back

This is ai generated but I actually have some networking knowledge, so I did guide it. Please let me know how well it works! I would recommend using ipv6 on it for the addressing since it uses SlAAC to do stuff. AI solution follows


To get raw speed (40Gbps+) between two Strix Halos, you must set the IOMMU to “Passthrough” mode. This allows the USB4 controller to write directly to RAM without the CPU checking every permission bit.

Run on BOTH Strix Halos:

  1. Open Grub config:

    sudo nano /etc/default/grub
    
    
  2. Find GRUB_CMDLINE_LINUX_DEFAULT and append:

    iommu=pt amd_iommu=on
    
    
  3. Update Grub and Reboot:

    Bash

    sudo update-grub  # or 'grub-mkconfig -o /boot/grub/grub.cfg' on Arch/Alpine
    sudo reboot
    
    

2. The “Forgiving” Kernel Tuning

The Strix Halo is fast, but balance-rr is chaotic. You must tune the TCP stack to accept out-of-order IPv6 packets without panicking.

Run on BOTH computers:

# Allow massive packet reordering (Essential for Bond Mode 0)
sudo sysctl -w net.ipv4.tcp_reordering=127

# Use BBR (Better for high-throughput/variable-latency links)
sudo sysctl -w net.core.default_qdisc=fq
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

# Maximize Memory Buffers (Crucial for >25Gbps)
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.wmem_max=134217728
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864"

(Note: I doubled the buffer sizes from the previous guide because two Strix Halos can actually fill them.)


3. The IPv6-Only Bond Configuration

This creates a single, non-load-balanced pipe (bond0) that stripes packets across both cables.

Run on BOTH computers:

# 1. Reset links
sudo ip link set thunderbolt0 down
sudo ip link set thunderbolt1 down

# 2. Create Bond (Mode 0 = Round Robin Striping)
sudo ip link add bond0 type bond mode balance-rr miimon=100

# 3. Enable Jumbo Frames (Required for CPU efficiency)
sudo ip link set bond0 mtu 65528

# 4. Enslave physical ports
sudo ip link set thunderbolt0 master bond0
sudo ip link set thunderbolt1 master bond0

# 5. HARD DISABLE IPv4 (Forces IPv6 only)
sudo sysctl -w net.ipv4.conf.bond0.disable_ipv4=1

# 6. Bring everything up
sudo ip link set thunderbolt0 up
sudo ip link set thunderbolt1 up
sudo ip link set bond0 up


4. Assigning Addresses (IPv6 ULA)

We use fd00:: (Unique Local Address) so it behaves like a static private LAN.

On Strix Halo A:

Bash

sudo ip -6 addr add fd00::1/64 dev bond0

On Strix Halo B:

Bash

sudo ip -6 addr add fd00::2/64 dev bond0
1 Like

I wait for my 2ed board for now … batch-18 …
but if someone can test happy to know what you get with this config.

did you have “bench” cmd to use?

Waiting for batch 18 too. Will report back when I finish the cluster

1 Like