AMD Strix Point (Windows) > AMD Strix Point (Windows): ~12 Gbit/s
AMD Strix Point (Windows) > Intel 12th gen (RHEL 10 clone) [default MTU 1500]: ~4.5 Gbit/s
AMD Strix Point (Windows) > Intel 12th gen (RHEL 10 clone) [MTU 61960): ~10 Gbit/s
Intel 12th gen (RHEL 10 clone) [MTU 61960] > AMD Strix Point (Windows): 18.8 Gbit/s
AMD Strix Point (Windows) > Intel Maple Ridge (Windows, legacy TB drivers) [MTU 65330]: 12.2 Gbit/s
Intel Maple Ridge (Windows, legacy TB drivers) [MTU 65330] > AMD Strix Point (Windows): 15.9 Gbit/s
So seems like
a) I remembered wrong, I hadn’t seen higher perf. between Strix Point vs. what you have seen with Strix Halo
b) The bottleneck seems to be around AMD’s controllers sending. Receiving they are completely fine and can keep up with anything. Up to the Gen3x1 connection, which I am limited to in all my scenarios, as expected.
But it’ll need far deeper dives to understand were that sending bottleneck lies exactly and if it can be overcome with better driver configs or whether this is a sad hardware / firmware limit inherent to AMD’s current controllers.
Note: the legacy TB drivers from Intel have only 3 very specific MTU options that match nothing else, not my fault. I have long believed this to have been the cause for bottlenecks, way back when I used the 12th gen Intel platform & Maple Ridge for tests. Anything with driver-connection manager is just way better. And my RHEL 10 clone screws up the default USB4Net config. This worked way better under Fedora…
@Ray519 thanks for the super detailed response on this, it a little over my head but there are enough handholds for me to pull myself up with more reading. Also fascinating to see your discovery that AMD sending is the issue not the receiving.
When I read your post and also @Djip’s link it seems that PCIe tunneling doesn’t seem to have the same issue that thunderbolt-net does? My tests w. storage devices show easily more throughput, is there some opportunity to go keep things in the PCI tunneling world by going USB4 to a PCI “dock” with a PCI NIC in it? Would that take thunderbolt-net out of the picture since the NIC would just be a PCI device then?
[edit: sorry, last bit of post got cut off. added it back]
PCIe tunneling is separate. With that, there is a root-port on the CPU just as with any physical port, and the USB4 controller simply ingests that into a USB4 PCIe tunnel. Every PCIe packet sent or received is just passed on basically unchanged.
Cross-Domain works differently. The host controller is used as a PCIe peripheral itself and makes the DMA accesses needed to get the data out of memory and send it to the opposing controller. Which does the same for its host memory.
This is the only high-bandwidth use of the USB4 Host controller directly. Anything else is just about configuring it and the data moves through dedicated ports that are forwarded (USB3, DP, PCIe).
So yes, independent limits. on top of most current 40G USB4 controllers limiting to 20G physical connections in Host-2-Host mode.
Given that the controller can write to memory (receiving) fast enough, chances are decent, that this is the software setup slowing it down and AMD may just be more particular and require different settings for ideal send performance. Or some workaround to get way more out of it. And just nobody has cared to look at that, especially on the AMD side.
Anybody that says this is for “Thunderbolt 5” is untrustworthy, as they don’t seem to have understood the basics, in that they are talking only about USB4. Linux just kept “thunderbolt” as the historic name of the driver that does USB4.
And like I said, Cross-Domain is essentially RDMA. USB4Net is software to emulate ethernet on top of that RDMA support in USB4. You can throw it out. USB4 hardware won’t care, since it does not look inside the DMA / Cross-Domain packets anyway and just routes them to the other side along a preconfigured, static route.