I’m having a tough time getting the NVIDIA drivers to recognize my RTX3060Ti graphics card in an eGPU enclosure (I’ve tried two - the Razer Core X and and Akito Node Titan). I’ve also tried Fedora 37, and pretty much every driver version from 470 to 525. I’m installing using the apt command (sudo apt install nvidia-driver-xxx). I’ve tried the open and regular drivers to no avail.
Looking at the kernel ring buffer output from dmesg, I several failures:
[ 7.735454] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x26:0x56:1253)
[ 7.735549] NVRM: GPU 0000:04:00.0: rminitadapter failed, device minor number 0
[ 7.736179] [drm:nvdrmload [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiDevice
[ 7.736604] [drm:nvdrmprobedevices [nvidiadrm]] ERROR [nvidia-drm] [GPU ID 0x00000400] Failed to register device
I have also the BIOS 3.06, a Razer Core X, and a 3060Ti! A lot in common.
Although, I am running Artix (a version of Arch Linux without systemd).
I am able to get the eGPU working (with nvidia-open) ONLY IF it is plugged-in at boot time. Then I use it both for ML (like you) or for gaming.
I think there is a regression somewhere because a long time ago I could just hotplug it and use it on Pytorch and such. The only thing that has never worked when hotplugging was the use by Xorg, but that was critical neither for ML or for gaming (many games happily use the GPU acceleration without the help of Xorg).
Artix huh? Which specific version of the drivers work for you?
I tried Ubuntu and Fedora only because they’ve been vetted on the Framwork but I’ll give Artix a try if it works. Anything but Windows… but in a bitter twist Windows found the GPU instantly, took about 30 mins to install cuda, python, pytorch etc, and seems to be OK with hot plugging. Still, I’ll go back to Linux if I can get it working.
I recommend trying egpu-switcher unless you’re an xorg conf wizard. There’s some specific configuration files that need to be created in order to use an external gpu on Linux, and egpu-switcher both adds those files and dynamically switches them back to the Iris Xe if the eGPU isn’t detected on boot.
I installed & configured egpu-switcher. It found the NVIDIA eGPU & set up the config files, but the NVIDIA drivers are still unable to find the eGPU. (same errors at boot, nvidia-smi still can’t detect the eGPU).
Everything I’ve tried except the NVIDIA drivers detects the eGPU, very odd.
sudo apt install nvidia-driver-xxx (tried 525, 515, 470 and 460)
sudo apt install cuda (desperate attempt hoping it would ‘just work’, which it didn’t)
From my original post, there are NVRM log items showing the driver trying to init something, but failing, leading to a cascade of errors.
Yup, Thunderbolt shows the eGPU & Direct Access is on.
Interesting that Wayland should also work. I’ve been using Xorg exclusively for the past while trying to get the eGPU to work, but good to know Wayland should also work, if I can ever get the card working in Ubuntu.
BTW, I’m only using the eGPU for CUDA so don’t care about monitor support etc.
As I also mentioned in my post, in Windows 11 the eGPU works perfectly with my Framework, even supporting hot-unplug & plug, which is amazing. So for now when I need to use CUDA I switch to Windows. I’d still like to get the eGPU working in Linux some day.
@Mapleleaf, it worked! The open drivers and the kernel parameter did the trick. Thanks so much for figuring this out. torch.cuda.is_availalbe() returns True so I think I can switch back to Linux now.
FYI, I’m on Ubuntu 22.04 LTS and the nvidia-driver-525-open returns a working GPU when I run nvidia-smi after setting the kernel parameter you mentioned. The hard part was removing ever single trace of nvidia and cuda before I could install the open driver.
Can you provide an update of the situation and maybe a shopping list? It looks like your case and mine are quite similar. This is what I want
Linux (anything else is a non-starter)
GPU used only for ML, never as graphics card (I want to continue using the built-in one)
I don’t want to do anything special when I decide to start working on a ML project or stop, so most of the time the GPU would be off (I don’t want the noise, or the electricity usage if I’m not doing ML), but when I’m ready it should be a matter of just turning it on / plugging it in.
I don’t want to change the external monitor HDMI cable, for example. This is why I prefer to continue use the built-in graphics card which is more than adequate for my day to day use.
The eGPU is continues to work great for compute with PyTorch on my 12th gen Framework running Ubuntu. Haven’t tried TensorFlow but assume it’d work as well. I’ve never plugged a monitor in the GPU, only used it for compute like you.
Just to be clear, the eGPU hot plugs, but doesn’t hot-unplug (the GUI will disappear if the TB cable is unplugged but won’t come back when plugged back in without a reboot).
Here’s my shopping list:
Ubuntu 22.04 LTS with kernel 6.1.9 (I’ve run earlier kernels successfully - I upgraded from the stock kernel in the hope of fixing some cursor jerkiness, which it did)
NVIDIA open kernel driver 525 (per this thread)
RTX 3060 Ti Founder’s Edition inside a Razer Core X eGPU enclosure
I’ve also got an RTX 2060 inside an Akitio Node Titan enclosure in another location and it works just as well, though a bit slower for compute because of the chip & less RAM.
I think the open kernel 525 drivers & setting the correct kernel parameters were what made it work. That and making sure EVERY TRACE of old NVIDIA drivers was removed before installing.