I tried this on Fedora Server 43 but hit a couple small issues:
Trivial syntax issue with the miimon parameter, just needed to lose the =
Error adding the thunderbolt links to the bond as configured
for #2, apparently bond with balance-rr requires the same MAC addresses on both links, so but the thunderbolt-net driver doesn’t allow changing the MAC address. AI suggested changing to an active-backup bond mode, which would halve the bandwidth. Or to bond them at the TCP level and use MTCP (not sure if this is a workable option for internode comms)
How did you sidestep these issues? What performance were you able to measure?
I’m using a cheap cable myself (UGOURD USB4 0.3m long, with E-Marker chip according to the description).
The Ryzen 395 has two USB4 v1 (40Gbit/s) ports are also capable of Thunderbolt 3 (also 40GBit/s).
To me it seems right now that the reason why we’re only getting 9GBit/s or so via thunderbolt-net on Linux is caused by the kernel layer and the ethernet-via-thunderbolt stack, not the cables.
I’m getting slightly over 9GBit/s per port in iperf3 on Linux.
# Enable MPTCP
net.mptcp.enabled=1
# Allow MPTCP to control path limits (4 subflows is plenty for 2 links)
net.mptcp.pm_type=0
net.mptcp.allow_join_initial_addr_port=1
Apply with sudo sysctl --system.
3. Configure the Endpoints (The “Magic” Step)
By default, Linux might only use the interface the traffic started on. You must explicitly tell the kernel to advertise the second interface as an available path.
On BOTH machines (run via script on boot):
Bash
# Tell the MPTCP path manager about the second link
# (Adjust interface names/IPs based on which machine you are on)
sudo ip mptcp endpoint add fd00:2::1 dev thunderbolt1 signal
Note: The signal flag tells the other side “Hey, I also have this IP address available for data.”
How to Test (Performance Measurement)
Standard tools like scp do NOT support MPTCP yet. You must use tools that are MPTCP-aware or force them.
1. Install mptcpd:
Bash
sudo dnf install mptcpd
2. Run iperf3 with MPTCP: MPTCP requires the application to request IPPROTO_MPTCP instead of IPPROTO_TCP. mptcpize forces this for legacy apps.
Server (Halo A):
Bash
mptcpize run iperf3 -s
Client (Halo B):
Bash
mptcpize run iperf3 -c fd00:1::1
Expected Result: You should see the bandwidth sum up (~40Gbps+). If you check ip mptcp monitor during the test, you will see it join the subflows.
@Thomas_Munn just to be clear - you aren’t saying that you actually did any of this and got 40Gbps, you are saying that you asked Gemini/whatever and it suggested this? Or did you try the original suggestion, hit the snag, swapped to MPTCP, then got it working?
I’m asking because I actually tried it, hit the snags I reported earlier, had Gemini spin its wheels for a bunch of time suggesting all sorts of things, every-increasing it’s “OMG I can’t believe this didn’t work now this will truly fix it” hallucinations - and eventually it landed on “ok, I give up on balance-rr because of thunderbolt-net, go use MPTCP”. This was the point I asked how you got around the issue
I’d love to see the results it it worked. If not, it seems we are using the same AIs (since they even suggest pretty much identical things and even down to MTU size (65528 vs 9000).
BTW, I’m sticking w. ipv4 for now because while ipv6 is a teeny bit faster in theory, all the tools seem better at dealing w. ipv4.
Bottom line? with 2 links (cable1 and cable2), running bi-di tests, I can see ~10Gbps TX and RX on both cable1 and cable2, on both FWD1 and FWD2. But 20Gbps aggregate bandwidth on a USB4 port is a far cry from the claimed 40Gbps.
Other things I have verified / areas I still need to test some more:
Both physical links are reporting as 2x20Gbps i.e. 40Gbps so I think the cables are fine. The cables are <1ft each.
All interfaces on both machines are in the trusted zone so no firewall overhead
Kernel params are iommu=pt and usb4_dma_protection=off. Fun fact was that I learned amd_iommu=on is not even a valid parameter, so articles saying turn that on are hallucinating
I tried all of the sysctl net.core.* settings you cited (my AI session proposed similar ones) but none of them had any appreciable affect on the measured 10Gbps soft cap
My AI session was worrying about power management on the USB4 controllers causing them to maybe flap, causing a downgrade to USB3.2 speeds (10Gpbs)
Lots of the suggestions from the AIs were “go change XYZ in the BIOS” which on the FWD is a complete non-starter
I’m still going to test all the random avenues that are being proposed including MPTCP but I’m not super hopeful. I really want someone to post actual iperf3 results that show something meaningfully over 10Gbps in one direction on the FWD via a USB4 ports. That way at least I know I’m not spinning my wheels.
Hi,
When measuring network performance, one cannot fill the pipe with a single tcp stream. So, you need to run a test that is not only multithreaded, but also creates multiple tcp links.
So, maybe a iperf command will do that, or run multiple iperf on different ports, or some other test tool that can create multiple tcp links.
I did note that I hadn’t tried this yet! I don’t have 2 thunderbolt devices to test with yet! the desktop is upstairs and the framework is downstairs……Hence the “AI” disclaimer…….
Understood! I saw the note that the writeup was AI generated, but I assumed the steps had been tested. I’m pretty new to all this so when I discovered that balance-rr was a non-starter I wanted to determine if the issue was my implementation or AI hallucination. My own experience with e.g. Gemini is that it’s very helpful for learning but it does go enthusiastically off the rails so for my part I’m striving to validate & document my steps.
Ah yes, it does seem like an improvement over the FWD’s 5Gbe RJ45, although in the Strix Halo Discord someone pointed out that the thunderbolt-net driver is locked to one core and built around a shared Tx/Rx queue across both USB4 ports. If true, very not ideal.
It’s unclear whether faster networking would actually benefit real scenarios, and if so how much faster networking / how much benefit. I’m trying to be very deliberate and to automate my settings and tests so that I can replicate them on different network setups. IMHO our community would derive benefit from more concrete data & repeatable tests.
Hi @James3, I took your suggestion and modified things to run a variable # of the tests in parallel. Same results, but that does support there being a hard limit somewhere in the software or hardware stack.
My tests are doing bidi (tx+rx) on both USB4 ports (thunderbolt0 and thunderbolt1). Running 1 instance, we see a little over 10Gbps on each port and in each direction, around 42-45Gbps total across both USB4s. Scaling up the # of instances 1→2→4→8→etc the individual throughput of each test drops proportional to the # of instances, and the total aggregate bandwidth stays at around the same level of 42-45Gbps.
pdrayton@fwd1:~$ ./run_multiple.sh 1 Launching 1 test instances… Aggregated total for 1 instances (2 total links): 45.73 Gbps (Tx+Rx) pdrayton@fwd1:~$ ./run_multiple.sh 2 Launching 2 test instances… Aggregated total for 2 instances (4 total links): 45.72 Gbps (Tx+Rx) pdrayton@fwd1:~$ ./run_multiple.sh 4 Launching 4 test instances… Aggregated total for 4 instances (8 total links): 45.72 Gbps (Tx+Rx) pdrayton@fwd1:~$ ./run_multiple.sh 8 Launching 8 test instances… Aggregated total for 8 instances (16 total links): 45.74 Gbps (Tx+Rx) pdrayton@fwd1:~$ ./run_multiple.sh 16 Launching 16 test instances… Aggregated total for 16 instances (32 total links): 45.82 Gbps (Tx+Rx) pdrayton@fwd1:~$ ./run_multiple.sh 32 Launching 32 test instances… Aggregated total for 32 instances (64 total links): 45.74 Gbps (Tx+Rx)
Reports are that other platforms’ USB4 does not have this issue, is seems to be a Strix Halo failing. I’ve not verified this myself yet, but I will eventually get around to testing it with two Nvidia GB10 units.
They are 1.5 foot, TB5, rated to 80Gbs (120Gbs if async but AFAIK that’s more of a Mac thing?). I have seen the recommendation to get Active cables but AFAICT that applies on longer runs of cables, I wasn’t even able to find active TB4/5 cables at 1 → 1.5ft lengths.
Digging through /sys/bus/thunderbolt/devices/\*-\* from both machines, they claim to be negotiated at the full 40Gbps (2x20) on both ends. From FWD1 we see these links to FWD2, and then the similar thing in reverse:
Device: 0-2 (fwd2) Negotiated: RX 20.0 Gb/s x 2 lanes | TX 20.0 Gb/s x 2 lanes Total Bandwidth: 40.0 Gbps Device: 1-2 (fwd2) Negotiated: RX 20.0 Gb/s x 2 lanes | TX 20.0 Gb/s x 2 lanes Total Bandwidth: 40.0 Gbps
I have more advanced scripts that saturate both links at once and aggregate results across many clients, but the simplest version that also shows the issue is this:
Server started on FWD2: iperf3 -s -B 10.0.0.2 -p 5202 -D
Client started on FWD1: iperf3 -c 10.0.0.2 -p 5202 -P 8 -t 30 --bidir
I’m consistently seeing it cap out at ~11Gbps in each direction:
Yes. but the main subject here is to know if we can have more than 40Gbps network…
For now look the only possibility is to make a dual 25Gbps Network card work, and aggregate the 2 links.
For thunderbolt we may have a 20Gbps network if we can find how to config a aggregated link…
What’s the end purpose of this network? To fully utilize such bandwith for inference you would need hardware support, since software alone would not be able to saturate it.
You can look into Infiniband network cards that provide support for RDMA.
OK, I had promised to do this a while back, got sidetracked with the MCX5 cards. Finally got around to measuring USB4 performance on the FWD to a high-speed storage device, and I can confirm the USB4 ports are entirely capable of pushing over 30Gbps sustained.
My tests were done w. a USB4/TB4 dock containing two different PCIe 5.0 M.2 drives. Same two USB4 cables that I’ve been using for my thunderbolt-net tests, same two FWD machines. The Fedora installs on both machines are done using an automated Kickstart file that images them from bare metal, so we can safely say that between the USB networking tests and the USB4 storage IO tests, everything is identical except for software stack (network vs storage) and the actual device on the other end of the cable.
Pretty sure this isn’t a Framework issue, this is a TB software stack issue. It is leaving >60% of potential on the cutting-room floor.
Despite this lacklustre result from thunderbolt-net, TB latency is still better than Ethernet, and TB throughput is better than anything up-to-and-including 10Gbps Ethernet. So USB4/TB is worth using in 2-node Strix Halo clusters for anyone not able to use >=25Gbps networking.
I don’t have 2x TB5 docks, I only grabbed the one from Amazon as a test. I do have identical multiples of everything else though, but I realistically have no use for a 2nd dock so I am loathe to buy a second.