AMD eGPU on Linux

Anyone else also experiencing weird complete freezes on hot unplug?

Found this smiliar issue, same behavior but only on hot-unplug

1 Like

you mean on Linux, I presume? Generally, before unplugging your eGPU you should first make sure no processes run on it (on Nvidia cards you can use nvidia-smi command, not sure what AMD’s counterpart is), otherwise it will crash badly. With no processes running it should unplug cleanly, if it doesn’t please post some logs and more detailed info here and on egpu.io.

1 Like

Yes I am on linux, in fact I am the author of the last comment on the issue I linked,
I updated with what seems to be a reproducible way to trigger the bug

I couldn’t find a similar tool for amd though (there’s rocm-smi but it doesn’t seem to have that feature)

1 Like

AMD GPUs will go into runtime PM when not in use. You can use the standard runtime PM API to check if they’re there. You can read it like this:

❯ cat /sys/class/drm/card0/device/power/runtime_status
active

Swap that card0 out for whatever card identification it got.

3 Likes

For the sake of it, here’s my 50/50 experience.

I am using Framework 13” with AMD Ryzen AI 9 HX 370, TB4 dock and rather old Radeon 5700 XT with 8GB VRAM. Running Kubuntu 24.04.3 LTS with mainline 6.18.0-061800-generic kernel.

What works

  • Hot plug on a freshly booted system works flawlessly! Never experienced any issues with that.
  • Ollama with ROCm works too and I even get decent speedup when running qwen3:30b. Though, every time I need to do service ollama restart for it to discover newly connected eGPU.
  • Vulkan games do work without issues (Cyberpunk, Veloren)
  • No random disconnects or freezes. I was able to run the eGPU connected for several days in a row without issues.

What kinda works

  • Hot unplug. Upon yanking the cable, the kernel drops to the system console, vomits a few lines about failed PCI device, then XOrg crashes and restarts to the login screen. Afterwards, I can log back in and continue using the laptop as usual. Standalone sleep works too. I haven’t noticed any system instability.

What doesn’t

  • I failed to make external display ports work. Monitor just won’t pick up the signal.
  • Attempt to sleep with eGPU connected results in a feeze upon resume.
  • The most annoying thing: after I unplug the eGPU, I am no longer able to plug it back in and make it work without rebooting the system. The device gets enumerated and is listed in lspci. However, ollama fails to recognize it (ROCm runner crashes). vulkaninfo also does not list it.

P.S.: @Mario_Limonciello Are there any methods to gracefully unplug the PCI device without crashing the XOrg session?

Use Wayland instead

1 Like

Also, if ROCM isn’t working after a hot plug, I suspect there is a kfd driver bug. This is definitely not a case that’s tested. You should file some bugs with kernel logs and tracebacks

1 Like

Freezing when sleeping with the egpu connected could be fixed by setting the egpu as the primary GPU for the gnome mutter compositor. This can be done manually or you can use my all-ways-egpu script that does this for you automatically.

1 Like

Just to clarify. On a clean run it does work. However, after I hot unplug and then plug it back in, it doesn’t.

Will do report the issue.

Thank you, Wayland indeed solved the hotplug issues. Session no longer crashes.

When I yank the TB cable it now recovers gracefully:

[ 2867.015654] thunderbolt 0-0:2.1: retimer disconnected
[ 2867.015681] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
[ 2867.015687] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[ 2867.017232] thunderbolt 0-2: device disconnected
[ 2867.077994] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[ 2867.078015] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[ 2867.182415] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[ 2867.286376] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 2867.286408] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[ 2867.286419] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[ 2867.587415] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[ 2867.787740] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 2867.787782] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[ 2867.787793] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[ 2867.869152] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
[ 2868.136518] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
[ 2868.141567] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[ 2868.672502] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 2868.937378] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[ 2868.937960] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 2868.937962] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[ 2868.937966] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features.
[ 2868.938054] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938057] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2868.938143] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938145] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2868.938230] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938231] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2868.938327] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938328] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2869.225087] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[ 2869.225090] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
[ 2869.999307] pci_bus 0000:05: busn_res: [bus 05] is released
[ 2870.000718] pci_bus 0000:04: busn_res: [bus 04-05] is released
[ 2870.001181] pci_bus 0000:03: busn_res: [bus 03-5f] is released
[ 2870.001989] pci_bus 0000:02: busn_res: [bus 02-5f] is released

However, on the immediate reconnect it fails to initialize. Apparently GPU was left in an inconsistent state:

[ 2984.317898] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[ 2984.317914] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[ 2984.318014] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[ 2984.318015] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[ 2984.318094] amdgpu 0000:05:00.0: amdgpu: failed to read discovery info from memory, vram size read: 0
[ 2984.318102] amdgpu 0000:05:00.0: amdgpu: [drm] *ERROR* discovery failed: -2
[ 2984.318105] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[ 2984.318108] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[ 2984.318123] amdgpu 0000:05:00.0: probe with driver amdgpu failed with error -2
[ 2984.318251] pci 0000:05:00.1: D0 power state depends on 0000:05:00.0

When I power cycled eGPU enclosure and reconnected the TB cable it finally initialized successfully:

[ 3098.412362] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[ 3098.412373] thunderbolt 0-2: TB4 HOME TB4 eGFX
[ 3099.141023] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833
[ 3099.256198] pcieport 0000:00:01.1: pciehp: Slot(0): Card present
[ 3099.256207] pcieport 0000:00:01.1: pciehp: Slot(0): Link Up
[ 3099.381658] pci 0000:01:00.0: [8086:1576] type 01 class 0x060400 PCIe Switch Upstream Port
[ 3099.381718] pci 0000:01:00.0: PCI bridge to [bus 00]
[ 3099.381736] pci 0000:01:00.0:   bridge window [io  0x0000-0x0fff]
[ 3099.381744] pci 0000:01:00.0:   bridge window [mem 0x00000000-0x000fffff]
[ 3099.381763] pci 0000:01:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.381785] pci 0000:01:00.0: enabling Extended Tags
[ 3099.382022] pci 0000:01:00.0: supports D1 D2
[ 3099.382025] pci 0000:01:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 3099.382818] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 8.000 Gb/s with 2.5 GT/s PCIe x4 link)
[ 3099.383117] pci 0000:01:00.0: Adding to iommu group 29
[ 3099.383318] pcieport 0000:00:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[ 3099.384983] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.385135] pci 0000:02:01.0: [8086:1576] type 01 class 0x060400 PCIe Switch Downstream Port
[ 3099.385184] pci 0000:02:01.0: PCI bridge to [bus 00]
[ 3099.385196] pci 0000:02:01.0:   bridge window [io  0x0000-0x0fff]
[ 3099.385201] pci 0000:02:01.0:   bridge window [mem 0x00000000-0x000fffff]
[ 3099.385217] pci 0000:02:01.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.385238] pci 0000:02:01.0: enabling Extended Tags
[ 3099.385384] pci 0000:02:01.0: supports D1 D2
[ 3099.385385] pci 0000:02:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 3099.385591] pci 0000:02:01.0: Adding to iommu group 30
[ 3099.385787] pci 0000:01:00.0: PCI bridge to [bus 02-5f]
[ 3099.385807] pci 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.385948] pci 0000:03:00.0: [1002:1478] type 01 class 0x060400 PCIe Switch Upstream Port
[ 3099.386015] pci 0000:03:00.0: BAR 0 [mem 0x00000000-0x00003fff]
[ 3099.386028] pci 0000:03:00.0: PCI bridge to [bus 00]
[ 3099.386045] pci 0000:03:00.0:   bridge window [io  0x0000-0x0fff]
[ 3099.386053] pci 0000:03:00.0:   bridge window [mem 0x00000000-0x000fffff]
[ 3099.386082] pci 0000:03:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.386385] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[ 3099.386630] pci 0000:03:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 3099.386978] pci 0000:03:00.0: Adding to iommu group 30
[ 3099.388966] pci 0000:02:01.0: PCI bridge to [bus 03-5f]
[ 3099.388989] pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.389170] pci 0000:04:00.0: [1002:1479] type 01 class 0x060400 PCIe Switch Downstream Port
[ 3099.389243] pci 0000:04:00.0: PCI bridge to [bus 00]
[ 3099.389260] pci 0000:04:00.0:   bridge window [io  0x0000-0x0fff]
[ 3099.389267] pci 0000:04:00.0:   bridge window [mem 0x00000000-0x000fffff]
[ 3099.389296] pci 0000:04:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.389608] pci 0000:04:00.0: PME# supported from D0 D3hot D3cold
[ 3099.390187] pci 0000:04:00.0: Adding to iommu group 30
[ 3099.390312] pci 0000:03:00.0: PCI bridge to [bus 04-5f]
[ 3099.390348] pci 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.390548] pci 0000:05:00.0: [1002:731f] type 00 class 0x030000 PCIe Legacy Endpoint
[ 3099.390666] pci 0000:05:00.0: BAR 0 [mem 0x00000000-0x0fffffff 64bit pref]
[ 3099.390675] pci 0000:05:00.0: BAR 2 [mem 0x00000000-0x001fffff 64bit pref]
[ 3099.390680] pci 0000:05:00.0: BAR 4 [io  0x0000-0x00ff]
[ 3099.390684] pci 0000:05:00.0: BAR 5 [mem 0x00000000-0x0007ffff]
[ 3099.390689] pci 0000:05:00.0: ROM [mem 0x00000000-0x0001ffff pref]
[ 3099.391108] pci 0000:05:00.0: PME# supported from D1 D2 D3hot D3cold
[ 3099.391458] pci 0000:05:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 3099.391663] pci 0000:05:00.0: Adding to iommu group 30
[ 3099.391687] pci 0000:05:00.0: vgaarb: setting as boot VGA device
[ 3099.391688] pci 0000:05:00.0: vgaarb: bridge control possible
[ 3099.391689] pci 0000:05:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 3099.391791] pci 0000:05:00.1: [1002:ab38] type 00 class 0x040300 PCIe Legacy Endpoint
[ 3099.391899] pci 0000:05:00.1: BAR 0 [mem 0x00000000-0x00003fff]
[ 3099.392144] pci 0000:05:00.1: PME# supported from D1 D2 D3hot D3cold
[ 3099.392437] pci 0000:05:00.1: Adding to iommu group 30
[ 3099.392584] pci 0000:04:00.0: PCI bridge to [bus 05-5f]
[ 3099.392616] pci_bus 0000:05: busn_res: [bus 05-5f] end is updated to 05
[ 3099.392628] pci_bus 0000:04: busn_res: [bus 04-5f] end is updated to 05
[ 3099.392637] pci_bus 0000:03: busn_res: [bus 03-5f] end is updated to 5f
[ 3099.392643] pci_bus 0000:02: busn_res: [bus 02-5f] end is updated to 5f
[ 3099.392663] pci 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392666] pci 0000:01:00.0: bridge window [mem 0x98000000-0xafffffff]: assigned
[ 3099.392667] pci 0000:01:00.0: bridge window [io  0x7000-0xafff]: assigned
[ 3099.392670] pci 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392672] pci 0000:02:01.0: bridge window [mem 0x98000000-0xafffffff]: assigned
[ 3099.392673] pci 0000:02:01.0: bridge window [io  0x7000-0xafff]: assigned
[ 3099.392676] pci 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392677] pci 0000:03:00.0: bridge window [mem 0x98000000-0xafefffff]: assigned
[ 3099.392678] pci 0000:03:00.0: BAR 0 [mem 0xaff00000-0xaff03fff]: assigned
[ 3099.392686] pci 0000:03:00.0: bridge window [io  0x7000-0xafff]: assigned
[ 3099.392688] pci 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392690] pci 0000:04:00.0: bridge window [mem 0x98000000-0xafefffff]: assigned
[ 3099.392691] pci 0000:04:00.0: bridge window [io  0x7000-0xafff]: assigned
[ 3099.392694] pci 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: assigned
[ 3099.392717] pci 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: assigned
[ 3099.392739] pci 0000:05:00.0: BAR 5 [mem 0x98000000-0x9807ffff]: assigned
[ 3099.392747] pci 0000:05:00.0: ROM [mem 0x98080000-0x9809ffff pref]: assigned
[ 3099.392748] pci 0000:05:00.1: BAR 0 [mem 0x980a0000-0x980a3fff]: assigned
[ 3099.392756] pci 0000:05:00.0: BAR 4 [io  0x7000-0x70ff]: assigned
[ 3099.392763] pci 0000:04:00.0: PCI bridge to [bus 05]
[ 3099.392768] pci 0000:04:00.0:   bridge window [io  0x7000-0xafff]
[ 3099.392778] pci 0000:04:00.0:   bridge window [mem 0x98000000-0xafefffff]
[ 3099.392785] pci 0000:04:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392798] pci 0000:03:00.0: PCI bridge to [bus 04-05]
[ 3099.392802] pci 0000:03:00.0:   bridge window [io  0x7000-0xafff]
[ 3099.392812] pci 0000:03:00.0:   bridge window [mem 0x98000000-0xafefffff]
[ 3099.392819] pci 0000:03:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392832] pci 0000:02:01.0: PCI bridge to [bus 03-5f]
[ 3099.392834] pci 0000:02:01.0:   bridge window [io  0x7000-0xafff]
[ 3099.392840] pci 0000:02:01.0:   bridge window [mem 0x98000000-0xafffffff]
[ 3099.392844] pci 0000:02:01.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392851] pci 0000:01:00.0: PCI bridge to [bus 02-5f]
[ 3099.392853] pci 0000:01:00.0:   bridge window [io  0x7000-0xafff]
[ 3099.392859] pci 0000:01:00.0:   bridge window [mem 0x98000000-0xafffffff]
[ 3099.392869] pci 0000:01:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392889] pcieport 0000:00:01.1: PCI bridge to [bus 01-5f]
[ 3099.392891] pcieport 0000:00:01.1:   bridge window [io  0x7000-0xafff]
[ 3099.392893] pcieport 0000:00:01.1:   bridge window [mem 0x98000000-0xafffffff]
[ 3099.392896] pcieport 0000:00:01.1:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.393149] pcieport 0000:01:00.0: enabling device (0000 -> 0003)
[ 3099.393365] pcieport 0000:02:01.0: enabling device (0000 -> 0003)
[ 3099.393531] pcieport 0000:02:01.0: pciehp: Slot #1 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
[ 3099.394106] pcieport 0000:03:00.0: enabling device (0000 -> 0003)
[ 3099.394335] pcieport 0000:04:00.0: enabling device (0000 -> 0003)
[ 3099.394688] pci 0000:05:00.0: disabling ATS
[ 3099.394799] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[ 3099.394816] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[ 3099.394904] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[ 3099.394906] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[ 3101.362812] amdgpu 0000:05:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (nv_common)
[ 3101.362824] amdgpu 0000:05:00.0: amdgpu: detected ip block number 1 <gmc_v10_0_0> (gmc_v10_0)
[ 3101.362827] amdgpu 0000:05:00.0: amdgpu: detected ip block number 2 <ih_v5_0_0> (navi10_ih)
[ 3101.362829] amdgpu 0000:05:00.0: amdgpu: detected ip block number 3 <psp_v11_0_0> (psp)
[ 3101.362832] amdgpu 0000:05:00.0: amdgpu: detected ip block number 4 <smu_v11_0_0> (smu)
[ 3101.362834] amdgpu 0000:05:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[ 3101.362836] amdgpu 0000:05:00.0: amdgpu: detected ip block number 6 <gfx_v10_0_0> (gfx_v10_0)
[ 3101.362838] amdgpu 0000:05:00.0: amdgpu: detected ip block number 7 <sdma_v5_0_0> (sdma_v5_0)
[ 3101.362840] amdgpu 0000:05:00.0: amdgpu: detected ip block number 8 <vcn_v2_0_0> (vcn_v2_0)
[ 3101.362842] amdgpu 0000:05:00.0: amdgpu: detected ip block number 9 <jpeg_v2_0_0> (jpeg_v2_0)
[ 3101.362881] amdgpu 0000:05:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[ 3101.493791] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 3101.493799] amdgpu: ATOM BIOS: xxx-xxx-xxx
[ 3101.496058] amdgpu 0000:05:00.0: vgaarb: deactivate vga console
[ 3101.496063] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 3101.496081] amdgpu 0000:05:00.0: amdgpu: PCIE atomic ops is not supported
[ 3101.496087] amdgpu 0000:05:00.0: amdgpu: GPU posting now...
[ 3101.496185] amdgpu 0000:05:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 3101.496216] amdgpu 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: releasing
[ 3101.496221] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: releasing
[ 3101.496243] pcieport 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496245] pcieport 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496246] pcieport 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496248] pcieport 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496249] pcieport 0000:00:01.1: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496261] pcieport 0000:00:01.1: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496264] pcieport 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496266] pcieport 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496268] pcieport 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496269] pcieport 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496271] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x39ffffffff 64bit pref]: assigned
[ 3101.496287] amdgpu 0000:05:00.0: BAR 2 [mem 0x3a00000000-0x3a001fffff 64bit pref]: assigned
[ 3101.496303] pcieport 0000:00:01.1: PCI bridge to [bus 01-5f]
[ 3101.496305] pcieport 0000:00:01.1:   bridge window [io  0x7000-0xafff]
[ 3101.496308] pcieport 0000:00:01.1:   bridge window [mem 0x98000000-0xafffffff]
[ 3101.496311] pcieport 0000:00:01.1:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496314] pcieport 0000:01:00.0: PCI bridge to [bus 02-5f]
[ 3101.496316] pcieport 0000:01:00.0:   bridge window [io  0x7000-0xafff]
[ 3101.496321] pcieport 0000:01:00.0:   bridge window [mem 0x98000000-0xafffffff]
[ 3101.496325] pcieport 0000:01:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496331] pcieport 0000:02:01.0: PCI bridge to [bus 03-5f]
[ 3101.496333] pcieport 0000:02:01.0:   bridge window [io  0x7000-0xafff]
[ 3101.496338] pcieport 0000:02:01.0:   bridge window [mem 0x98000000-0xafffffff]
[ 3101.496342] pcieport 0000:02:01.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496348] pcieport 0000:03:00.0: PCI bridge to [bus 04-05]
[ 3101.496351] pcieport 0000:03:00.0:   bridge window [io  0x7000-0xafff]
[ 3101.496358] pcieport 0000:03:00.0:   bridge window [mem 0x98000000-0xafefffff]
[ 3101.496363] pcieport 0000:03:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496372] pcieport 0000:04:00.0: PCI bridge to [bus 05]
[ 3101.496375] pcieport 0000:04:00.0:   bridge window [io  0x7000-0xafff]
[ 3101.496382] pcieport 0000:04:00.0:   bridge window [mem 0x98000000-0xafefffff]
[ 3101.496387] pcieport 0000:04:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496401] amdgpu 0000:05:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 3101.496404] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 3101.496425] [drm] Detected VRAM RAM=8176M, BAR=8192M
[ 3101.496426] [drm] RAM width 256bits GDDR6
[ 3101.496567] amdgpu 0000:05:00.0: amdgpu: amdgpu: 8176M of VRAM memory ready
[ 3101.496570] amdgpu 0000:05:00.0: amdgpu: amdgpu: 31787M of GTT memory ready.
[ 3101.496601] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 3101.496776] [drm] PCIE GART of 512M enabled (table at 0x00000081FEE00000).
[ 3101.498032] amdgpu 0000:05:00.0: amdgpu: [VCN instance 0] Found VCN firmware Version ENC: 1.21 DEC: 7 VEP: 0 Revision: 2
[ 3101.553873] amdgpu 0000:05:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 3101.597877] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 3101.603756] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 3101.603758] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 3101.603853] amdgpu 0000:05:00.0: amdgpu: use vbios provided pptable
[ 3101.603856] amdgpu 0000:05:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 3101.640541] amdgpu 0000:05:00.0: amdgpu: SMU is initialized successfully!
[ 3101.640967] amdgpu 0000:05:00.0: amdgpu: [drm] Display Core v3.2.351 initialized on DCN 2.0
[ 3101.640970] amdgpu 0000:05:00.0: amdgpu: [drm] DP-HDMI FRL PCON supported
[ 3101.648040] amdgpu 0000:05:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
[ 3101.686201] amdgpu: HMM registered 8176MB device memory
[ 3102.188920] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 3102.189044] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 3102.189120] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 3102.693189] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 3102.693576] amdgpu: Virtual CRAT table created for GPU
[ 3102.694375] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[ 3102.694382] kfd kfd: amdgpu: added device 1002:731f
[ 3102.694416] amdgpu 0000:05:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 40
[ 3102.694423] amdgpu 0000:05:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 3102.694426] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 3102.694428] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 3102.694429] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 3102.694430] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 3102.694431] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 3102.694433] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 3102.694434] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 3102.694435] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 3102.694436] amdgpu 0000:05:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 3102.694437] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 3102.694439] amdgpu 0000:05:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 3102.694440] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 3102.694441] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 3102.694442] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 3102.694443] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 3102.698096] amdgpu 0000:05:00.0: amdgpu: Using BOCO for runtime pm
[ 3102.700204] amdgpu 0000:05:00.0: [drm] Registered 6 planes with drm panic
[ 3102.700208] [drm] Initialized amdgpu 3.64.0 for 0000:05:00.0 on minor 2
[ 3102.703170] amdgpu 0000:05:00.0: [drm] Cannot find any crtc or sizes

However, the thing is… now I am unable to make ollama to use ROCm :sweat_smile:

дек 18 14:10:28 fw13 ollama[9888]: time=2025-12-18T14:10:28.997+05:00 level=INFO source=runner.go:464 msg="failure during GPU discov
ery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:
0]" error="runner crashed"

rocminfo initially showed this:

korvin@fw13:~$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

after I power cycled the eGPU it started doing this:

korvin@fw13:~$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Invalid argument
korvin is member of render group

OLLAMA_VULKAN=1 mode works, though. So… partial success?

Can you please check /sys/class/kfd after you unplugged to see if eGPU node is still present? It should have removed one entry.

Indeed, that’s exactly what happens.

After clean boot I have two entries in /sys/class/kfd/topology/nodes/:

  • 0 with gpu_id 0
  • 1 with gpu_id 35022 which is probably eGPU

When I yank the cable and plug it back in, there appears node/2, again, with gpu_id 35022. Every time I plug eGPU in, a new node appears.

I noticed that rocminfo suceeds before the disconnect and then fails like this:

korvin@fw13:/sys/class/kfd$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

Apparently it accesses node 1 which is now defunct.

P.S.: Just in case, here’s the clean rocm output:

e[37mROCk module is loadede[0m
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.15
Runtime Ext Version:     1.7
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      49152(0xc000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5157                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65101096(0x3e15d28) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    65101096(0x3e15d28) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65101096(0x3e15d28) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65101096(0x3e15d28) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1150                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5390(0x150e)                       
  ASIC Revision:           4(0x4)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2900                               
  BDFID:                   49408                              
  Internal Node ID:        1                                  
  Compute Unit:            16                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 31                                 
  SDMA engine uCode::      14                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32550548(0x1f0ae94) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    32550548(0x1f0ae94) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1150         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx11-generic   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

OK - I think what might be going on is that by the “surprise hotplug” we never get a chance to clean up kfd software nodes then.

See if this helps:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4c8b03b6bf5..f40a83be2cac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5263,6 +5263,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
        if (drm_dev_is_unplugged(adev_to_drm(adev)))
                amdgpu_device_unmap_mmio(adev);
 
+       /* surprise hotplug */
+       if (pci_dev_is_disconnected(adev->pdev))
+               amdgpu_amdkfd_device_fini_sw(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
1 Like

Quite a quest here.

I am trying to patch build the amdgpu module for my kernel. However, the kernel required gcc-15 which is not present in my Kubuntu 24.04, so I have to build the compiler from sources.

To save time I am trying to patch and build only one module instead of building the whole kernel. Don’t sure if it’d work.

In the end I managed to build the kernel module successfully, but when I try to make modules_install it fails:

korvin@fw13:~/work/kernel/mainline-crack/drivers/gpu/drm/amd/amdgpu$ sudo make -C /lib/modules/$(uname -r)/build M=$PWD modules_install
[sudo] password for korvin: 
make: Entering directory '/usr/src/linux-headers-6.18.0-061800-generic'
make[1]: Entering directory '/home/korvin/work/kernel/mainline-crack/drivers/gpu/drm/amd/amdgpu'
  INSTALL /lib/modules/6.18.0-061800-generic/updates/amdgpu.ko
  SIGN    /lib/modules/6.18.0-061800-generic/updates/amdgpu.ko
At main.c:171:
- SSL error:FFFFFFFF80000002:system library::No such file or directory: ../crypto/bio/bss_file.c:67
- SSL error:10000080:BIO routines::no such file: ../crypto/bio/bss_file.c:75
sign-file: /usr/src/linux-headers-6.18.0-061800-generic/certs/signing_key.pem
  DEPMOD  /lib/modules/6.18.0-061800-generic
Warning: modules_install: missing 'System.map' file. Skipping depmod.
make[1]: Leaving directory '/home/korvin/work/kernel/mainline-crack/drivers/gpu/drm/amd/amdgpu'
make: Leaving directory '/usr/src/linux-headers-6.18.0-061800-generic'

Could you give me a hint on what to do next, please?

P.S.: Last time I build the kernel myself was quite a while ago, in the era of kernel 2.4.18 and 2.5.65. A lot changed since then :sweat_smile:

I wouldn’t build using the Ubuntu packaging. It’s a pain in the butt to get right. Upstream kernel source has Deb packaging you can use.

Clone the mainline or stable tree.

Apply patch

Add the kernel config in place. If you use the one from your current kernel you’ll need to turn off some of the cert options.

Then build using make bindeb-pkg -j$(nproc).

You’ll get a deb to install.

1 Like

Any progress with testing that?

Hi Mario,

I tried doing it that day, but failed to build the kernel due to missing dwarf.h. Every tutorial I seen mentioned libdwarf-dev but it was installed and yet, the issue remained. Then I was busy with my discrete ML research.

Also in the meantime I used OLLAMA_VULKAN which appeared to work decently on my setup, with the exception of very slow load times on iGPU.

At last, today I found libdw-dev that apparently fixed the missing header issue. Will try testing your patch today.

Ok, I’ve just built vanilla 6.18.0 with your patch applied.

After clean boot with no eGPU attached:

korvin@fw13:~$ uname -r
6.18.0
$ ls /sys/class/kfd/kfd/topology/nodes/
0  1

Here’s ROCm output:

e[37mROCk module is loadede[0m
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.15
Runtime Ext Version:     1.7
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      49152(0xc000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5157                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1150                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5390(0x150e)                       
  ASIC Revision:           4(0x4)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2900                               
  BDFID:                   49408                              
  Internal Node ID:        1                                  
  Compute Unit:            16                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 31                                 
  SDMA engine uCode::      14                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    24556348(0x176b33c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    24556348(0x176b33c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1150         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx11-generic   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Relevant ollama log:

янв 04 15:39:29 fw13 ollama[6588]: time=2026-01-04T15:39:29.029+05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
янв 04 15:39:29 fw13 ollama[6588]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: found 1 ROCm devices:
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: initializing rocBLAS on device 0
янв 04 15:39:29 fw13 ollama[6588]: rocBLAS error: Cannot read /usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1150
янв 04 15:39:29 fw13 ollama[6588]:  List of available TensileLibrary Files :
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx908.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1151.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx942.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" error="runner crashed"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" devices=[]
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=1.02787478s OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=DEBUG source=runner.go:153 msg="filtering device which didn't fully initialize" id=0 libdir=/usr/local/lib/ollama/rocm pci_id=0000:c1:00.0 library=ROCm
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=TRACE source=runner.go:174 msg="supported GPU library combinations before filtering" supported=map[]
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=TRACE source=runner.go:183 msg="removing unsupported or overlapping GPU combination" libDir=/usr/local/lib/ollama/rocm description="AMD Radeon Graphics" compute=gfx1150 pci_id=0000:c1:00.0
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=2.274042056s
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="46.8 GiB" available="40.1 GiB"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=INFO source=routes.go:1648 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"

Now I attach eGPU TB4 cable:

[  663.694730] amdgpu: HMM registered 8176MB device memory
[  664.195196] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[  664.195930] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  664.196007] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[  664.699497] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[  664.699888] amdgpu: Virtual CRAT table created for GPU
[  664.700462] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[  664.700470] kfd kfd: amdgpu: added device 1002:731f
...

korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0  1  2

korvin@fw13:~$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

After I restarted ollama eGPU was recognized and I was able to run qwen3:8b model on it.

Now I yank the TB4 cable:

[  916.452198] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
[  916.452206] thunderbolt 0-0:2.1: retimer disconnected
[  916.452213] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[  916.453612] thunderbolt 0-2: device disconnected
[  916.514923] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[  916.514945] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[  916.619952] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[  916.743173] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  916.743185] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[  916.743194] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[  916.743279] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  916.743281] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
...
[  917.282084] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[  917.334190] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
[  917.605625] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
[  917.611390] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[  918.216273] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[  918.517956] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[  918.518720] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  918.518723] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[  918.518730] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features.
[  918.518819] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[  918.518824] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  918.518911] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[  918.518914] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  918.519001] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[  918.519003] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  918.519089] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[  918.519092] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  918.842104] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[  918.842111] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
[  919.591291] pci_bus 0000:05: busn_res: [bus 05] is released
[  919.591409] pci_bus 0000:04: busn_res: [bus 04-05] is released
[  919.591519] pci_bus 0000:03: busn_res: [bus 03-5f] is released
[  919.592608] pci_bus 0000:02: busn_res: [bus 02-5f] is released



Node 2 was removed successfully!

korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0  1

However, rocminfo still does not work:

korvin@fw13:~$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Invalid argument
korvin is member of render group

I reconnected the TB4 cable and apparently eGPU was initialized successfully with no errors in dmesg:

[ 1044.936694] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[ 1044.936717] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[ 1044.937008] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[ 1044.937011] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[ 1046.902212] amdgpu 0000:05:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (nv_common)
[ 1046.902227] amdgpu 0000:05:00.0: amdgpu: detected ip block number 1 <gmc_v10_0_0> (gmc_v10_0)
[ 1046.902231] amdgpu 0000:05:00.0: amdgpu: detected ip block number 2 <ih_v5_0_0> (navi10_ih)
[ 1046.902234] amdgpu 0000:05:00.0: amdgpu: detected ip block number 3 <psp_v11_0_0> (psp)
[ 1046.902237] amdgpu 0000:05:00.0: amdgpu: detected ip block number 4 <smu_v11_0_0> (smu)
[ 1046.902240] amdgpu 0000:05:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[ 1046.902243] amdgpu 0000:05:00.0: amdgpu: detected ip block number 6 <gfx_v10_0_0> (gfx_v10_0)
[ 1046.902246] amdgpu 0000:05:00.0: amdgpu: detected ip block number 7 <sdma_v5_0_0> (sdma_v5_0)
[ 1046.902249] amdgpu 0000:05:00.0: amdgpu: detected ip block number 8 <vcn_v2_0_0> (vcn_v2_0)
[ 1046.902251] amdgpu 0000:05:00.0: amdgpu: detected ip block number 9 <jpeg_v2_0_0> (jpeg_v2_0)
[ 1046.902284] amdgpu 0000:05:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[ 1047.034560] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 1047.034569] amdgpu: ATOM BIOS: xxx-xxx-xxx
[ 1047.038291] amdgpu 0000:05:00.0: vgaarb: deactivate vga console
[ 1047.038296] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 1047.038314] amdgpu 0000:05:00.0: amdgpu: PCIE atomic ops is not supported
[ 1047.038321] amdgpu 0000:05:00.0: amdgpu: GPU posting now...
[ 1047.038439] amdgpu 0000:05:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 1047.038466] amdgpu 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: releasing
[ 1047.038472] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: releasing
...
[ 1047.038539] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x39ffffffff 64bit pref]: assigned
[ 1047.038558] amdgpu 0000:05:00.0: BAR 2 [mem 0x3a00000000-0x3a001fffff 64bit pref]: assigned
...
[ 1047.038703] amdgpu 0000:05:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 1047.038707] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 1047.038731] [drm] Detected VRAM RAM=8176M, BAR=8192M
[ 1047.038733] [drm] RAM width 256bits GDDR6
[ 1047.038912] amdgpu 0000:05:00.0: amdgpu: amdgpu: 8176M of VRAM memory ready
[ 1047.038915] amdgpu 0000:05:00.0: amdgpu: amdgpu: 23980M of GTT memory ready.
[ 1047.038958] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 1047.039204] [drm] PCIE GART of 512M enabled (table at 0x00000081FEE00000).
[ 1047.040824] amdgpu 0000:05:00.0: amdgpu: [VCN instance 0] Found VCN firmware Version ENC: 1.21 DEC: 7 VEP: 0 Revision: 2
[ 1047.097013] amdgpu 0000:05:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 1047.141891] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 1047.147893] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1047.147896] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 1047.147989] amdgpu 0000:05:00.0: amdgpu: use vbios provided pptable
[ 1047.147992] amdgpu 0000:05:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 1047.184943] amdgpu 0000:05:00.0: amdgpu: SMU is initialized successfully!
[ 1047.185661] amdgpu 0000:05:00.0: amdgpu: [drm] Display Core v3.2.351 initialized on DCN 2.0
[ 1047.185665] amdgpu 0000:05:00.0: amdgpu: [drm] DP-HDMI FRL PCON supported
[ 1047.193038] amdgpu 0000:05:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
[ 1047.242675] amdgpu: HMM registered 8176MB device memory
[ 1047.746506] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 1047.746892] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 1047.746973] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 1048.250264] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 1048.250668] amdgpu: Virtual CRAT table created for GPU
[ 1048.251169] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[ 1048.251177] kfd kfd: amdgpu: added device 1002:731f
[ 1048.251267] amdgpu 0000:05:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 40
[ 1048.251276] amdgpu 0000:05:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1048.251280] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1048.251282] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1048.251285] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 1048.251287] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 1048.251289] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 1048.251290] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 1048.251292] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 1048.251293] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 1048.251295] amdgpu 0000:05:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 1048.251297] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1048.251299] amdgpu 0000:05:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 1048.251300] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 1048.251302] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 1048.251304] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 1048.251305] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 1048.255110] amdgpu 0000:05:00.0: amdgpu: Using BOCO for runtime pm
[ 1048.257697] amdgpu 0000:05:00.0: [drm] Registered 6 planes with drm panic
[ 1048.257704] [drm] Initialized amdgpu 3.64.0 for 0000:05:00.0 on minor 2
[ 1048.263330] amdgpu 0000:05:00.0: [drm] Cannot find any crtc or sizes
[ 1048.263726] pci 0000:05:00.1: D0 power state depends on 0000:05:00.0

However, after I restarted ollama it failed to pick up the eGPU:

янв 04 15:54:10 fw13 ollama[9976]: time=2026-01-04T15:54:10.341+05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
янв 04 15:54:10 fw13 ollama[9976]: ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
янв 04 15:54:10 fw13 ollama[9976]: load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so

rocminfo still does not work:

korvin@fw13:~$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Invalid argument
korvin is member of render group

So, in the end, reconnect still fails but now sys node ids do not leak.

Ok that’s great progress at least. Would you be able to trace where the EINVAL is happening from the ioctl call?

Hm, that’s strange.

I did strace rocminfo, the only calls that return EINVAL are:

openat(AT_FDCWD, "/dev/kfd", O_RDWR)    = -1 EINVAL (Invalid argument)
write(1, "\33[31mUnable to open /dev/kfd rea"..., 62Unable to open /dev/kfd read-write: Invalid argument
...
prctl(PR_CAPBSET_READ, CAP_MAC_OVERRIDE) = 1
prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, CAP_CHECKPOINT_RESTORE) = 1
prctl(PR_CAPBSET_READ, 0x2c /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x2a /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x29 /* CAP_??? */) = -1 EINVAL (Invalid argument)
munmap(0x708ba1dc6000, 114647)          = 0

The file itself exists and is writable:

korvin@fw13:~$ ls -l /dev/kfd
crw-rw---- 1 root render 509, 0 янв  4 19:52 /dev/kfd

App armor is now disabled.