Recently picked up a Minisforum DEG2 dock and an nvidia 5060ti. It seems that no matter what, I cannot get the GPU to run a load without a hard lockup/bootloop.
I am on Fedora 44
I have tried with the akmod and the akmod-open drivers - and no matter what, the same issue happens:
Boot to desktop
nvidia-smi shows the GPU
Run anything that puts a load on the gpu
ollama run ‘your fav model’
The system freezes and enters a reboot - sometimes booting 2-3x before coming back.
Currently spinning up windows 11 on a second drive see if this is isolated to Fedora 44 - but open to any suggestions/help/success stories.
Id rather not go the occulink route.. for now - but this dock is capable of doing so if that is the end all.
Hey, thanks for updating the bug entry on kernel.org bugzilla (for anyone else reading this thread, I refer to this report) Hopefully someone will have a look into this finally.
Please let them know also there, that it runs perfectly fine on Windows, so that they don’t dismiss it as a hardware issue.
@James3 if you look at the bug report, I’ve already tried exactly that and it doesn’t help: it still reboots There are some AER errors in the logs then, which made Mario think it’s a hardware issue (like bad quality cable or something), but the thing is that on Windows it works perfectly stable.
@Morgwai
I am a bit confused. That comment 6 link is not about a FW16 7040 series laptop.
How does it related to this thread about a FW16 laptop?
(TUXEDO IBP-14 gen10, HX370 APU)
As @knipp30 pointed in comment 13 he (and many others on egpu.io with AMD CPUs) face exactly the same behavior. Therefore it’s reasonable to assume, that the root cause is the same and not specific to laptop model.
Although the symptoms appear to be similar. The root cause in unlikely to be the same.
There are many many different possible reasons for a forced reboot.
If you wish to make progress with diagnosis on your laptop / dock, you would need to start by disabling any thunderbolt, usb devices, and turning them on / plugging them back in one at a time to see which one causes the reboot. From the dmesg log output, there appears to be many different things plugged in. Try to narrow down the cause if you can.
I beg to differ regarding this. Anyway, hopefully we will find out soon.
FYI, that’s just the one when I gathered the logs. I’ve tried like dozens of configurations, connecting and disconnecting stuff and moving between ports: the result was always the same: reboot on Linux, working perfectly fine on Windows.
From the descriptions so far of the problem, and the fact that you are seeing AER errors and sync-flood errors. These are hardware based problems.
I have no idea how to fix those. I also have no idea why windows is apparently working, as I don’t see how windows can somehow fix hardware problems that are present in Linux.
I suggest you raise a support ticket with the DEG2 supplier.
Other threads on here have seen fixes for some AER errors by upgrading SSD firmware. But You don’t give any details of what devices are plugged in, so I have no idea if an SSD is involved or not.
Well, ideally I can fully rule out hardware as the culprit soon - I have a UT3G headed this way as well - but really liked the all-in-one form factor of the DEG2.
Hopefully I can get it working with one of these docks - at least till the occulink devkit starts becoming available..
The fact that Windows works perfectly fine clearly shows that these are not hardware problems: it works if only the software layer properly handles PCIe tunneling over TB5. Perhaps there’s a Linux kernel bug that triggers reporting of bogus errors in an otherwise healthy situation and then a cascade of these causes a reboot (as indicated by the reset-reason message we are getting on the next boot).
ASM2464PD(X) based adapters (like UT3G / UT4G, AG02), work perfectly fine in my case and I can bet it will be similarly in your case. This suggests that the problem is specifically in how the kernel handles PCIe over TB5 tunneling (there were a few reports on egpu.io describing identical symptoms with EG02 and AG03 TB5 adapters).
@Morgwai
I am just another user like you. I know less than you about this problem because hardly any details are available in the bug reports.
But from previous conversations with AMD, only a mainboard manufacturer can debug “sync-flood” problems because special debug BIOS and test/debug mainboards are needed to track it down. The manufacturer of the DEG2 would also need to be involved. That, combined with hardly anyone having access to the AMD documentation due to NDAs, means very few people, if any, are able to make any progress on this problem.
This is not a DEG2 specific problem: so far it seems to affect every TB5 eGPU adapter: DEG2, EG02, AG03, TH5P4. There have been exactly zero successful AMD + TB5 + Linux reports on egpu.io.
(BTW: Intel-based laptops also fail with TB5 on Linux, but their symptoms don’t include rebooting: just severe connection instabilities resulting in GPUs falling of the PCIe buses)
Fortunately Mario Limonciello from AMD is already involved. If we can get either Tuxedo or Framework engs involved as well, there are chances for fixing this.
Just an idea, if this USB4/Thunderbolt works to the deg2 in windows. My guess is that there is some “quirk” that the windows thunderbolt/usb4 driver works around, that is not worked around in Linux.
There are some quirks mentioned in the linux thunderbolt driver (it also does usb4).
Linux kernel source code: linux-kernel/drivers/thunderbolt/quirks.c
It might be worth experimenting to see if any of the quirks in there can be used with the mainboard you have. If you are adventurous, edit the source code there, and try out some of the quirks to see if any help at all.
Also, maybe open the deg2 and list all the chips it uses, so one has some idea of what chips appear to be incompatible currently with your mainboard and Linux.
That’s a great idea! I’ll have a look into this. Thanks for pointing this
There is only one TB5 chip currently available on the market: Intel JHL9480. All the mentioned adapters reported on egpu.io (DEG2, EG02, AG03, TH5P4) use this exact chip and all of them fail in exactly the same way on Linux.