eGPU issues with AMD FW 16

Here is the discussion thread for it.

https://lore.kernel.org/linux-kernel/20240416114955.GT223006@ziepe.ca/T/#r422b6d80178fe3f91c544b6322bdc5bdfd39521f

Lately my Framework 16 will not consistently detect the eGPU, but when it does it works fine. I have found experimentally that disconnecting and reconnecting the type-c expansion card tends to help it discover the eGPU. If anyone is having issues where sometimes it connects and sometimes it doesn’t, try that. Note that it doesn’t work every time.

OS: Arch
Kernel: 6.9.4-zen-1-zen

I’m using my eGPU on windows 11 and for me it was just plug and play. Razer Core X Chroma with a 6900xt as well. https://youtu.be/RcDk_DsIMT8?si=bFUSuZPvE5D8QCqV

I’m continuing to have this recent issue of my eGPU not working anymore. It doesn’t appear to be either the gpu or the eGPU enclosure (Razer Core X) as I’ve tested with both an RTX 3070ti and RTX 3080ti, neither work on the FW 16 but I don’t seem to have the same issue on my Thinkpad. I was able to determine that lspci is able to see the device, but the nvidia gpu never shows up in /dev/dri as a listed card. It would appear something’s going awry when loading the driver. If I run dmesg | grep -i nvidia I get the following output:

[  306.287690] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[  306.288565] nvidia 0000:65:00.0: enabling device (0000 -> 0003)
[  306.288667] nvidia 0000:65:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[  306.288685] NVRM: The NVIDIA GPU 0000:65:00.0
[  306.288832] nvidia 0000:65:00.0: probe with driver nvidia failed with error -1
[  306.288845] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  306.288846] NVRM: None of the NVIDIA devices were initialized.
[  306.288981] nvidia-nvlink: Unregistered Nvlink Core, major device number 235
[  307.160034] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[  307.161178] nvidia 0000:65:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none

I’m on Arch and have tested both the nvidia package and nvidia-dkms.

I don’t yet know if this is an issue with my configuration of Arch, or if it’s something to do with the FW 16/Bios. I’ve found at least one article of someone with a similar issue reporting that a bios update helped them, but I don’t believe that was a framework.

I had some issues with arch + razer Core X (amd GPU)

Try amd_iommu=off in grub cmdline

add nvidia module in mkinitcpio

and this udev rule 99-removable.rules
ACTION==“add”, SUBSYSTEM==“thunderbolt”, ATTR{authorized}==“0”, ATTR{authorized}=“1”

Wait a few seconds before login, I don’t know why in recent kernels I have to wait 2-5 seconds before login

There is a module change behavior in thunderbolt that it will reset during bootup (just like windows does). You can change the behavior on kernel command line.

Iirc it’s

thunderbolt.host_reset=0

I tried some of your guy’s change suggestions as well as a thunderbolt fix I found on the Arch forums (adding pci=hpbussize=0x33,hpmemsize=256M to kernel parameters) but it seems what did the trick for me was simply waiting up to 10 seconds before logging in. That would explain the inconsistency of the issue, and once I was intentionally waiting it was highly repeatable to get the eGPU working. Not sure what this means, but at least I have a work around for the moment.

As warned on the Arch wiki, it does not appear that unplugging and re-plugging the eGPU works, at least on my Plasma Wayland setup.

Hi friends - I’m experiencing similar issues on my FW13 7040 here. The eGPU appears to be recognized and outputs to an external monitor but I get a laggy desktop and poor performance in-game suggesting the iGPU is trying to do that 4k rendering. Weirdly, I think I’ve had it working buttery smooth a couple of times after hot-plugging the eGPU but I cannot find consistency.

I’ve:

  1. Followed the guide for Ubuntu and AMD.
  2. Added pci=hpbussize=0x33 hpmemsize=256M amd_iommu=off pcie_aspm=off to kernel parameters.
  3. Tried egpu-switcher.
  4. Tried a manual config alternative.
  5. Updated amdgpu firmware from latest release (not the rest of the kernel).
  6. Adding a 10s delay to auto-login.

I note that the iGPU is always allocated as 0 (c1@0:0:0) and the eGPU as 1 (7@0:0:0) and both show as using the amdgpu driver. Updating amdgpu led to more/accurate information in LACT but not much else from my observations.

Should I go straight to upgrading the whole kernel to a more recent mainline (6.8+)?

If you setup and use mangohud (fps counter used by steam deck) to run your game, it can tell you what GPU you’re using. Here’s a (bad) screenshot I have on hand from playing Star Citizen the other day. You may or may not be able to read it, but it says the render device is my 3080ti.

https://wiki.archlinux.org/title/MangoHud

1 Like

That is more definitive than listening to the fans (and low FPS) thank you, I will try and report back.

Annoyingly it all works perfectly in Win11 but I’m trying to fully migrate away.

When I’m running a game on steam, I add prime-run mangohud %command% into the game’s launch parameters. This tells it to use my nvidia gpu if it’s available (and to my good fortune will use my iGPU if the eGPU is not connected), enable the mangohud interface I’m a fan of, and then launch the game (%command%). It sounds like you’re using an AMD card in your enclosure though, so your mileage may vary.

On my system, some games actually perform significantly worse on the eGPU than the iGPU, namely Baldurs Gate 3. I’m hoping support for explicit sync resolves some of this in the upcoming nvidia 455 driver update.

Okay, installed MangoHud and added mangohud %command% to the launch options two Steam games. Neither would launch and just failed silently. Tried mangohud=1 %command% and it launches but doesn’t show the overlay. :person_shrugging:

I’m pretty convinced it is not using the iGPU not the eGPU from viewing the GPU utilization, power usage, temperatures and so on.

~$ boltctl
● Intel Tamales Module 2
├─ type: peripheral
├─ name: Tamales Module 2
├─ vendor: Intel
├─ uuid: c1010000-0092-940e-03e6-2dd6b2427008
├─ generation: Thunderbolt 3
├─ status: authorized
│ ├─ domain: 535f3804-700d-b01e-ffff-ffffffffffff
│ ├─ rx speed: 20 Gb/s = 2 lanes * 10 Gb/s
│ ├─ tx speed: 20 Gb/s = 2 lanes * 10 Gb/s
│ └─ authflags: boot
├─ authorized: Tue 25 Jun 2024 16:38:19 UTC
├─ connected: Tue 25 Jun 2024 16:38:19 UTC
└─ stored: Mon 06 May 2024 18:39:41 UTC
├─ policy: iommu
└─ key: no

~$ lspci

07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] (rev c0)

c1:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (rev c4)

It may be time to try Mainline.

Tried going to 6.8.12 and it wouldn’t boot. Rebooted back into 6.5.0 and uninstalled 6.8.12.

I notice everything is snappy, I get 60fps in Manor Lords?!

The rx/tx speeds now show as :

40 Gb/s = 2 lanes * 10 Gb/s

eGPU is clearly being used. I wish I knew what changed this and knew it wouldn’t be different next boot!

It would appear that waiting 10 seconds before login on it’s own no longer fixes the nvidia driver not loading. It was 100% reproducible when I tested last time, which makes this a bummer. I may just buy an AMD GPU if those have better driver support, especially with the direction nVidia seems to be going with it’s approach to partnerships. AMD seems to be much more aligned with my personal philosophies, in no small part thanks to Mario.