[RESOLVED] NVIDIA drivers failing to load eGPU on Ubuntu 22.04.1 BIOS 3.06 beta

I’m having a tough time getting the NVIDIA drivers to recognize my RTX3060Ti graphics card in an eGPU enclosure (I’ve tried two - the Razer Core X and and Akito Node Titan). I’ve also tried Fedora 37, and pretty much every driver version from 470 to 525. I’m installing using the apt command (sudo apt install nvidia-driver-xxx). I’ve tried the open and regular drivers to no avail.

Looking at the kernel ring buffer output from dmesg, I several failures:

[ 7.735454] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x26:0x56:1253)
[ 7.735549] NVRM: GPU 0000:04:00.0: rminitadapter failed, device minor number 0
[ 7.736179] [drm:nvdrmload [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiDevice
[ 7.736604] [drm:nvdrmprobedevices [nvidiadrm]] ERROR [nvidia-drm] [GPU ID 0x00000400] Failed to register device

When I run lspci, the eGPU is found:

04:00.0 VGA compatible controller: NVIDIA Corporation GA104 GeForce RTX 3060 Ti 04:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

The eGPU is also listed as Thunderbolt device in the settings.

Googling the errors, it looks like there’s some issue with the NVIDIA linux drivers and the BIOS. I’ve tried every version of the drivers I could find and they all give the same errors.

Anyone have any ideas? If you’ve got the eGPU working in Ubuntu, which drivers did you run & how did you install them?

Fortunately the eGPU seems to work on Windows. I’m mainly using it for CUDA with PyTorch so unfortunately I’m having to live in Windows now, but I’d really like to go back to Ubuntu.

Any help much appreciated!

1 Like

I have also the BIOS 3.06, a Razer Core X, and a 3060Ti! A lot in common.

Although, I am running Artix (a version of Arch Linux without systemd).

I am able to get the eGPU working (with nvidia-open) ONLY IF it is plugged-in at boot time. Then I use it both for ML (like you) or for gaming.

I think there is a regression somewhere because a long time ago I could just hotplug it and use it on Pytorch and such. The only thing that has never worked when hotplugging was the use by Xorg, but that was critical neither for ML or for gaming (many games happily use the GPU acceleration without the help of Xorg).

Artix huh? Which specific version of the drivers work for you?

I tried Ubuntu and Fedora only because they’ve been vetted on the Framwork but I’ll give Artix a try if it works. Anything but Windows… but in a bitter twist Windows found the GPU instantly, took about 30 mins to install cuda, python, pytorch etc, and seems to be OK with hot plugging. Still, I’ll go back to Linux if I can get it working.

1 Like

I’m using the package nvidia-open-dkms 525.60.11-3 for the drivers.

And my kernel is 6.0.12-artix1-1.

I will maybe create a bootable Ubuntu 22.04.1 microSD to try it out and see if I can get the eGPU to work, and then I’ll report it back here (although probably in a few days, as I can’t right now).

I recommend trying egpu-switcher unless you’re an xorg conf wizard. There’s some specific configuration files that need to be created in order to use an external gpu on Linux, and egpu-switcher both adds those files and dynamically switches them back to the Iris Xe if the eGPU isn’t detected on boot.

Be_Far, thanks for the suggestion.

I installed & configured egpu-switcher. It found the NVIDIA eGPU & set up the config files, but the NVIDIA drivers are still unable to find the eGPU. (same errors at boot, nvidia-smi still can’t detect the eGPU).

Everything I’ve tried except the NVIDIA drivers detects the eGPU, very odd.

Just to clarify, these are with 12th gen Framework laptops?

If nvidia-smi isn’t seeing the card, I would question whether or not the driver is installed? How was the NVIDIA driver installed?

Make sure:

  • Plugging in your eGPU, going to ‘Software and Update’, going to the ‘Additional Drivers’ tab, and selecting the appropriate Nvidia driver.

  • On Ubuntu 22.04 12th gen Framework, Settings, Thunderbolt, click unlock and make sure Direct Access is toggled on.

  • Wayland has better support than Xorg, although egpu-switcher helps with the GPU switching for Xorg.

Yup, 12th gen.

I’ve tried installing the drivers multiple ways…

  • Using the GUI as you described
  • sudo apt install nvidia-driver-xxx (tried 525, 515, 470 and 460)
  • sudo apt install cuda (desperate attempt hoping it would ‘just work’, which it didn’t)

From my original post, there are NVRM log items showing the driver trying to init something, but failing, leading to a cascade of errors.

Yup, Thunderbolt shows the eGPU & Direct Access is on.

Interesting that Wayland should also work. I’ve been using Xorg exclusively for the past while trying to get the eGPU to work, but good to know Wayland should also work, if I can ever get the card working in Ubuntu.

BTW, I’m only using the eGPU for CUDA so don’t care about monitor support etc.

As I also mentioned in my post, in Windows 11 the eGPU works perfectly with my Framework, even supporting hot-unplug & plug, which is amazing. So for now when I need to use CUDA I switch to Windows. I’d still like to get the eGPU working in Linux some day.

@Steven_Kasapi
I ran some tests on a Xubuntu 22.10 installed on a sdcard, and I can confirm that:

  • Hotplug doesn’t work (it works PCI-wise, but nvidia-smi says “No devices found”)
  • If I start the laptop with the eGPU already ON and plugged-in, it does not even boot (it freezes in the middle of the boot sequence)

But it works well on Artix Linux (when plugged-in at boot).

@Steven_Kasapi
Very good news!!

I got it to work by using the kernel parameter nvidia.NVreg_OpenRmEnableUnsupportedGpus=1.
Source: (K)Ubuntu 22.10 not booting (kernel OOPS) for driver >450 with eGPU - #3 by generix - Linux - NVIDIA Developer Forums

In summary:

  • used the distribution XUbuntu 22.10 (full bootable installation on a sdcard)
  • installed the package nvidia-driver-525-open
  • added the kernel parameter nvidia.NVreg_OpenRmEnableUnsupportedGpus=1
  • hotplugging works!! (as well as when already plugged-in at boot time)
  • not tried yet hot-UNplugging, as it is widely reported to be very messy (in my experience too)

Now it works completely and nvidia-smi gives me a sensible output!
(when hotplugging, you may have to wait several seconds before it reflects in the output of nvidia-smi)

Hope it helps!

5 Likes

This looks very promising! Please keep us posted.

1 Like

@Mapleleaf, it worked! The open drivers and the kernel parameter did the trick. Thanks so much for figuring this out. torch.cuda.is_availalbe() returns True so I think I can switch back to Linux now.

FYI, I’m on Ubuntu 22.04 LTS and the nvidia-driver-525-open returns a working GPU when I run nvidia-smi after setting the kernel parameter you mentioned. The hard part was removing ever single trace of nvidia and cuda before I could install the open driver.

3 Likes

Very glad that it worked for you too!
Marking it solved in the title.

2 Likes

Oh, this is very interesting. I’ll test this parameter when an nvidia driver releases for the 6.1 or 6.2 kernel (as those fix issues booting Linux from a storage expansion card).

1 Like

Thanks a lot! This trick also worked on Arch Linux with my new RTX 3060 in my Razor Core X enclosure! I was close to sending it back! :grinning:

2 Likes

Seeing much success here, this is great!

1 Like

Thank you. This also helped on rog zephyrus g15 ga503rm.

1 Like

Can you provide an update of the situation and maybe a shopping list? It looks like your case and mine are quite similar. This is what I want

  • Linux (anything else is a non-starter)
  • ML workloads
  • GPU used only for ML, never as graphics card (I want to continue using the built-in one)
  • Hotplug

I don’t want to do anything special when I decide to start working on a ML project or stop, so most of the time the GPU would be off (I don’t want the noise, or the electricity usage if I’m not doing ML), but when I’m ready it should be a matter of just turning it on / plugging it in.

I don’t want to change the external monitor HDMI cable, for example. This is why I prefer to continue use the built-in graphics card which is more than adequate for my day to day use.

The eGPU is continues to work great for compute with PyTorch on my 12th gen Framework running Ubuntu. Haven’t tried TensorFlow but assume it’d work as well. I’ve never plugged a monitor in the GPU, only used it for compute like you.

Just to be clear, the eGPU hot plugs, but doesn’t hot-unplug (the GUI will disappear if the TB cable is unplugged but won’t come back when plugged back in without a reboot).

Here’s my shopping list:

  • Ubuntu 22.04 LTS with kernel 6.1.9 (I’ve run earlier kernels successfully - I upgraded from the stock kernel in the hope of fixing some cursor jerkiness, which it did)
  • NVIDIA open kernel driver 525 (per this thread)
  • RTX 3060 Ti Founder’s Edition inside a Razer Core X eGPU enclosure
  • I’ve also got an RTX 2060 inside an Akitio Node Titan enclosure in another location and it works just as well, though a bit slower for compute because of the chip & less RAM.

I think the open kernel 525 drivers & setting the correct kernel parameters were what made it work. That and making sure EVERY TRACE of old NVIDIA drivers was removed before installing.

@Steven_Kasapi
What cuda install method did you use?

Edit: Nvm, it just works without installing cuda. I guess cuda comes with the open drivers?