Help wanted - eGPU bootloop issues

How are you testing the speed of your nvme device?

I ran into that a while ago and it turns out in my case it was just a testing and reporting issue, single threaded dd maxed out at way below the expected 3.5ish GB/s so I was chasing my tail because the usb4 root kept reporting as pcie1.

Turns out single threaded dd just can’t saturate the disk and kdismark did the 3.5GB/s just fine and the usb4 root just always reports the same speed but that has no actual impact on throughput.

@Mario_Limonciello @James3 sorry for the delay: Windows installer decided to throw a tantrum… For now I’ve attached lspci outputs on Linux to the kernel bugzilla report. I will continue trying to install Windows today.

@James3, the lower speed is almost certainly NV’s power management, for example here are two lspci excerpts, the first one is when the card is idle, the second when cuda-z is running:

morgwai@morgwai-x4tuxedo:~$ sudo lspci -vvvnns 3:00.0 |grep -i lnksta
		LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (downgraded)
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
morgwai@morgwai-x4tuxedo:~$ sudo lspci -vvvnns 3:00.0 |grep -i lnksta
		LnkSta:	Speed 16GT/s, Width x4 (downgraded)
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

@knipp30, regarding attachments on kernel bugzilla: there’s “Add an attachment” link near the top of the page right under the list of existing attachments.

Forgot to mention, I received the below message from Minisforum today, I’ll let it speak for itself:

Dear Customer,

Thank you for your patience.

After further discussion regarding your issue, we sincerely apologize that we are unable to conduct accurate debugging on Linux systems due to technical limitations. This problem is likely an isolated compatibility issue specific to Linux. We recommend installing Windows 11 for testing to verify if the random restart still occurs.

Best regards,

[footer truncated to protect the privacy of the representative]

:frowning:

@Mario_Limonciello @James3, I’ve attached lspci output on Win11.

Mine was just as bad from them…

They gave me the following:

Please reconnect all the parts according to this video: https://www.youtube.com/watch?v=ObK8BskOYPQ
Keep the computer powered off, then turn on your power supply first. Then observe if the graphics card fan starts to rotate. If it doesn’t start rotating, press the forced power-on button on the deg1 panel to observe if the graphics card fan starts rotating. If it still doesn’t rotate, check if the oculink cable is damaged. Try replacing it with a new one. If the fan starts rotating, press the power button on the computer to boot it up. After booting up, enter the system. First, connect the HDMI cable to the HDMI port on the computer, then enter the device manager to check if your graphics card is recognized. If it is recognized, first install the graphics driver from the graphics card official website, then connect the HDMI port to the graphics card and check if it works.

This is after telling them I am using USB4, I have a DEG2, and sending a video of the failure to them…

So tl;dr, help from them will likely be useless.

1 Like

Hey all, the saga continues :slight_smile: :zany_face:

I have 2 eGPU docks:
UT3G - This works, 100% and is working “well”
DEG2 - This does not work, not all all, not with USB4, not with Occulink

I am sure I am doing something wrong for the Occulink, as there are Framework users stating that Occulink with the 50 series cards are working in other threads (with the DEG1 and DEG2) - what Im not sure is if this is in Linux.

With the DEG2 and Occulink, the GPU shows under lspci output, but is not picked up by the Nvidia driver. I have tried every combination of kernel parameter I could find, and it never seems to work.

Im open to ideas here - because the improved performance seems like it might be worth at least checking out (and, hell, i bought all the parts).

Anything interesting in the logs? (sudo journalctl -b |grep -iE 'nvidia|nvl|nvr')
If there’s nothing there, then maybe try rescanning the PCIe bus (echo 1 >/sys/bus/pci/rescan)

1 Like

I already shut down. But here are the main culprits in dmesg

unexpected WPR2 already up

RmInitAdapter: Cannot initialize GSP firmware RM

RmInitAdapter failed! (0x62:0x40:2028)

@Morgwai

I think your problems are maybe being caused by this device:
63:00.0 Ethernet controller [0200]: Motorcomm Microelectronics. YT6801 Gigabit Ethernet Controller [1f0a:6801] (rev 01)

So, please try physically removing that device if you can, and try again.

Another aspect might be the retimers.
While retimers are not mentioned on the lspci output on windows, they are mentioned on the linux lspci output, and show as disabled on the nvidia gpu card.
So, maybe linux does not support the particular retimer on that particular gpu card yet.

This is the most common internal error of the NV driver. Judging by the previous line (RmInitAdapter: Cannot initialize GSP firmware RM), this is a conflict between NV’s and Framework firmwares. Therefore check if you have the latest Framework firmware, the latest vBIOS for your 5060ti and that you are using the latest NV driver (595.71.x).
Also, have you tried rescanning the PCIe bus?

If none of the above doesn’t help, then you should send a bug report to NV’s Linux subforum, but unfortunately, as I myself ranted recently, you will probably just watch it being ignored :frowning:

As crude workaround, you can try the same trick as for TB mode: use the proprietary flavor of the driver (install cuda-drivers package instead of nvidia-open) and disable GSP firmware (options nvidia NVreg_EnableGpuFirmware=0). Not sure if it will help here however, as this may be due to vBIOS not GSP, but worth trying. …And of course even if it does, performance penalty will be even up to 50% in some scenarios…

Finally, I’ve just had a look and there hasn’t been any Framework+Blackwell builds posted on egpu.io yet (Best External Graphics Card Builds | eGPU.io), neither Win nor Linux and neither on Intel nor on AMD. Try to ask the folks that reported success here in the other threads which OS and CPU they have.

It’s a built-in NIC, so not sure if it’s possible to remove it or if it’s soldered. I’ll try opening the laptop later today to check, but I honestly doubt it has anything to do with the TB5 problem as none of the 5 other laptop models on which the same problem was reported, uses this NIC.
Why do you think it may be related?

Do you mean retimers on the GPU card or on the DEG2 adapter? Because as mentioned in the first entry in kernel bugizlla report, the same physical card works perfectly fine with this laptop on Linux when connected with any other non-TB5 adapter (USB4 UT4G, TB3 TH3P4G3, OCuLink DEG1 and DEG2).

The lspci -vvv looks like the retimers are not being used on either the DEG2 or the eGPU card.
Retimers work best on the receiving side of a link, so the DEG2 retimers are the best ones to switch on as they are on the receiving side of the USB4 thunderbolt cable.
I don’t know how to switch them on. It is normally done using firmware in the device (DEG2)

02:00.0 PCI bridge [0604]: Intel Corporation JHL9480 Thunderbolt 5 80/120G Bridge [Barlow Ridge Hub 80G 2023] [8086:5786] (rev 85) (prog-if 00 [Normal decode])
...
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported

03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1) (prog-if 00 [VGA controller])
...
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
1 Like

I’ve just had a look on UT4G, they are marked as unsupported also, but maybe TB5 needs them “more critically” than USB4? @Mario_Limonciello what do you think?

morgwai@morgwai-x4tuxedo:~$ sudo lspci -vvvnns 2:0.0 |grep -iE '^0|retimer'
02:00.0 PCI bridge [0604]: ASMedia Technology Inc. Device [1b21:2461] (prog-if 00 [Normal decode])
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
morgwai@morgwai-x4tuxedo:~$ sudo lspci -vvvnns 0:1.2 |grep -iE '^0|retimer'
00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo PCIe USB4 Bridge [1022:150a] (prog-if 00 [Normal decode])
		LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
morgwai@morgwai-x4tuxedo:~$ sudo lspci -vvvnns 3:0.0 |grep -iE '^0|retimer'
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1) (prog-if 00 [VGA controller])
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
			 Retimer- 2Retimers- CrosslinkRes: unsupported

On windows, can you do a dump of the PCIe config space in hex.
i.e.
lspci -xxxx (4 x is the best, as it captures more). It should capture 4096 bytes from each device. If you see less, try running it as an admin user or something like that.

and then do the same on Linux.
sudo lspci -xxxx

We can then look for any differences at the byte level
It appears that the windows lspci does not capture some of the PCIe gen 4 bits, so getting a hex dump might capture everything.

1 Like

@James3,

  1. I’ve attached lspci -xxxx to the kernel.org bug report:
    Linux: Making sure you're not a bot!
    Win11: Making sure you're not a bot!
    As before, I removed the NVMe with the main OS before booting Windows, so there will be one device less in the Windows output.

  2. I’ve tried disabling the Motorcomm NIC, but it seems it was not designed to be disabled by users. I’ve reached out to Tuxedo for help regarding this: reporting TB5 problems involving IBP-14 gen10 AMD to linux-usb (#186) · Issues · TUXEDO Computers / Development / Packages / linux · GitLab

Thanks again for your help and involvement!

The windows output is different from the linux output of lspci -xxxx.
This might be a permissions thing.
Notice how the linux output is about 4096 bytes for each device, but the windows output is about 256 bytes for each device.
Did you use run-as-administrator for the windows one?

It might just be a permissions thing, but also this device is missing on the windows side:
The IOMMU device is missing on the windows side.

--- lspci-xxxx-tuxedo-ibp14gen10-deg2-linux-F.txt       2026-05-04 11:41:29.894997030 +0100
+++ lspci-xxxx-tuxedo-ibp14gen10-deg2-windows2-F.txt    2026-05-04 11:48:42.138653136 +0100
@@ -3,16 +3,6 @@
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

-00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo IOMMU
-       Subsystem: AIstone Global Limited Device 5006
-       Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
-       Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
-       Interrupt: pin A routed to IRQ 255
-       Capabilities: [40] Secure device <?>
-       Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
-               Address: 00000000fee00000  Data: 002b
-       Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

I right-clicked on the PowerShell and chose “Run as administrator” or something along that lines and the window title had ‘Administrator’ in its title: I’d guess it means yes, but honestly I know nothing about Windows (before this DEG2 thing, the last time I had a Windows installed was like in 2001).

Thank you for the lspci -xxxx output.
I can then use the lspci -F option on Linux to view both windows and linux output.

Was the lspci -xxxx done while the GPU was in use of windows?
I.e. playing a game or something like that.
Some of the PCIe config on the windows side appears to be disabled when I would not expect it:
In windows:

01:00.0 PCI bridge: Intel Corporation JHL9480 Thunderbolt 5 80/120G Bridge [Barlow Ridge Hub 80G 2023] (rev 85) (prog-if 00 [Normal decode])
        Subsystem: Device 2222:1111
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 255
        Bus: primary=01, secondary=02, subordinate=02, sec-latency=0
        I/O behind bridge: 0000f000-00000fff [disabled] [32-bit]
        Memory behind bridge: fff00000-000fffff [disabled] [32-bit]
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]

In Linux:

01:00.0 PCI bridge: Intel Corporation JHL9480 Thunderbolt 5 80/120G Bridge [Barlow Ridge Hub 80G 2023] (rev 85) (prog-if 00 [Normal decode])
        Subsystem: Device 2222:1111
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 255
        Bus: primary=01, secondary=02, subordinate=60, sec-latency=0
        I/O behind bridge: 0000a000-0000dfff [size=16K] [32-bit]
        Memory behind bridge: c4000000-db0fffff [size=369M] [32-bit]
        Prefetchable memory behind bridge: 0000007800000000-00000097f1ffffff [size=130848M] [64-bit]

No, I ran lspci right after connecting the DEG2.
I’ve created a version dumped when running UnrealEngine demo: Making sure you're not a bot!