I was mostly curious. Even fairly ‘tech savvy’ people like Jeff Geerling seemed to use the 5gbps network card, when there are at least TWO 40gbps usb4/thunderbolt ports on the monster. I know that linux and windows and even macs support data transfer over thunderbolt cable. Any reason why no one has tried it yet? I will try it and report results.
Surprised me too, would have expected at least a small blurb why it didn’t work.
Did they only send one to Wendell (may explain why he didn’t try it)?
I use a 10Gbps Fiber cable USB4/Thunderbolt adapter and it works well. (with a few Linux kernel patches)
So, it at least works at 10Gbps.
I don’t know of any USB4/Thunderbolt interfaces that do above 10Gbps.
We are talking about direct tb/usb4 networking, not an usb4 nic.
As far as that goes, noone is stopping you from putting an 100gbit nic in an egpu enclosure XD.
Probably because thunderbolt cables are expensive especially in any length beyond literal desktop use. So if my servers are across the room there is just no way I am spending hundreds of dollars to connect them with thunderbolt when I can do so for a lot less money for a reliable connection.
The point is for mini clusters where they are literally stacked next to each other. Noone is proposing usb4 networking for long range but to interlink nodes very close to each other at high bandwidth for cheap (short usb4 cables are quite afordable).
My guess is mostly due to reliability. It is relatively easy to have too much pressure on a usbc connector break part of the connection…now add troubleshooting a cluster to the mix, and not being sure moment to moment if the connection is solid. If everything is laid out perfectly, and you have some means to relieve the stress on the connector and wire sure…but still asking for headaches. Interested in seeing how this plays out though.
There could be a lot of reasons:
- USB4 networking is a daisy chain (unless I’ve missed an example of using a more traditional “star” type networking), therefore the data from the first node to the last node is being transferred through all of the other nodes. Seems like this would give you inconsistent transfer speeds (and latency) depending on how many nodes were transferring data and how many nodes the data has to pass thru
- I’ve not seen any reports that anyone has been able to accomplish anywhere near 40gbps (I also don’t pay that close of attention to this). I’ve seen some examples of 20gbps, but most seem to be in the low teens if using multiple nodes.
- I’d be curious to know that once models are loaded into the cluster, how much data is being passed from one node to another during the actual execution of a task? Yes the 5gbps would be a bottle neck getting a large model loaded into all the nodes, but once loaded (something you would only do upon reboot of the cluster), the data connection may not be as important?
- I’d be much more inclined to take advantage of PCIE 4.0 4X slot. That theoretically could make 64 GT/s (or around 30 GB/s) meaning 25gbps or even a “nuetered” 100gbps connection between nodes possible.
- Lastly, all of this is really more academic anyway, at what point are you better off just buying a Mac Studio? For just shy of 10K you can get the Mac Studio with 512GB of shared RAM that is 4x greater memory bandwidth than the AMD 395+, gets rid of all the networking bandwidth and multiple node complexity and by the time you buy 4 Framework desktop motherboards, Storage, and Networking gear, already nearing $8K for a less capable cluster.
In a 3 node cluster you could have a direct connection from every node to every other node.
20Gbit>>>5Gbit right?
If you are loading models to big to fit into a single devices memory you have to transmit a lot of the internal state between the nodes for each step which is quite a lot of data and afaik the main limiting factor for such clustered setups.
Probably the better option, especially to scale out past 3 nodes but also significantly more expensive. Hell even the cheapest setup I can currently imagine (20$ chinesium x540s) is almost double of what 2 0.5m usb4 cables would cost. And also way less invasive.
That is a point but as far as I can tell most people doing these clusters are doing them as an accademic excercise so me and Thomas are wondering why this was not tried. It is quite possible usb4 networking is just borked again rn or something but that usually tends to get mentioned by jeff and wendell and co.
Jeff has a writeup on the website:
For networking, I expected more out of the Thunderbolt / USB4 ports, but could only get 10 Gbps. The built-in NICs are 5 Gbps and I had no problems reaching that speed over Ethernet. I’m hopeful drivers or Linux tweaks would be able to bump Thunderbolt node-to-node connectivity at least over 15 or 20 Gbps.
Someone linked a thunderbolt networking config GitHub Gist there, and the comments under the Gist also mention low throughput, so seems like this is a linux/driver issue and is not plug and play at all.
I find it interesting cuz the guy who was in the video was getting 20 gigabits. I wonder if it was just his crappy cables?
It’s just anecdata, but I do have a Proxmox cluster of Framework Laptop 13 mainboards with 20 gigabit networking over Thunderbolt. It performs spectacularly, until one of the devices suddenly drops from the bus and both need to be powered off for thirty minutes while they both forget about eachother.
(They’re 11th generation mainboards, so they predate Thunderbolt certification. This may be one of the reasons why.)
Guy from video and @DHowett in last post both have Intel, while Framewok Desktop is AMD, could also be that, we don’t know, yeah.
Simple answer: TB cables are expensive and heavy. So for any length of cable that costs $100’s, there is a good chance it will randomly pull out because USB-C physical connections are about as firm as my 60 yr old gut is these days! ![]()
Seriously, I wish ‘they’ would come up a better idea for the TB connectors, something like DP video cable connector clips, where they are small but effective.
OWC has these ClingOn clips for exactly that purpose. Not a standard tho so only for their gear. But, a very good idea
Again, noone asked about long distance, the question was about clustering where cheap passive 0.5m cables will do just fine and can be routed with minimal accidental unplugging risks.
I doubt you’ll find 2 25Gbit (or hell even 10gbit) nics + dacs and/or cable for the price of a 0.5m usb4 cable.
Two-meter USB4 cables cost as little as 16 bucks from reputable brands. Thunderbolt cables are not expensive anymore
TBH, I wonder why we aren’t asking about fibre?
I connected two Strix Halo boxes° with cheap USB4 Thunderbolt 3 cables and I’m getting slightly over 9GBit/s in iperf3 on the thunderbolt0 and thunderbolt1 interface with Fedora 43 and kernel 6.18.3 (also simultaneously).
° not Framework desktop