Anyone else also experiencing weird complete freezes on hot unplug?
Found this smiliar issue, same behavior but only on hot-unplug
Anyone else also experiencing weird complete freezes on hot unplug?
Found this smiliar issue, same behavior but only on hot-unplug
you mean on Linux, I presume? Generally, before unplugging your eGPU you should first make sure no processes run on it (on Nvidia cards you can use nvidia-smi command, not sure what AMD’s counterpart is), otherwise it will crash badly. With no processes running it should unplug cleanly, if it doesn’t please post some logs and more detailed info here and on egpu.io.
Yes I am on linux, in fact I am the author of the last comment on the issue I linked,
I updated with what seems to be a reproducible way to trigger the bug
I couldn’t find a similar tool for amd though (there’s rocm-smi but it doesn’t seem to have that feature)
AMD GPUs will go into runtime PM when not in use. You can use the standard runtime PM API to check if they’re there. You can read it like this:
❯ cat /sys/class/drm/card0/device/power/runtime_status
active
Swap that card0 out for whatever card identification it got.
For the sake of it, here’s my 50/50 experience.
I am using Framework 13” with AMD Ryzen AI 9 HX 370, TB4 dock and rather old Radeon 5700 XT with 8GB VRAM. Running Kubuntu 24.04.3 LTS with mainline 6.18.0-061800-generic kernel.
What works
qwen3:30b. Though, every time I need to do service ollama restart for it to discover newly connected eGPU.What kinda works
What doesn’t
lspci. However, ollama fails to recognize it (ROCm runner crashes). vulkaninfo also does not list it.P.S.: @Mario_Limonciello Are there any methods to gracefully unplug the PCI device without crashing the XOrg session?
Use Wayland instead
Also, if ROCM isn’t working after a hot plug, I suspect there is a kfd driver bug. This is definitely not a case that’s tested. You should file some bugs with kernel logs and tracebacks
Freezing when sleeping with the egpu connected could be fixed by setting the egpu as the primary GPU for the gnome mutter compositor. This can be done manually or you can use my all-ways-egpu script that does this for you automatically.
Just to clarify. On a clean run it does work. However, after I hot unplug and then plug it back in, it doesn’t.
Will do report the issue.
Thank you, Wayland indeed solved the hotplug issues. Session no longer crashes.
When I yank the TB cable it now recovers gracefully:
[ 2867.015654] thunderbolt 0-0:2.1: retimer disconnected
[ 2867.015681] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
[ 2867.015687] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[ 2867.017232] thunderbolt 0-2: device disconnected
[ 2867.077994] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[ 2867.078015] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[ 2867.182415] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[ 2867.286376] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 2867.286408] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[ 2867.286419] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[ 2867.587415] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[ 2867.787740] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 2867.787782] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[ 2867.787793] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[ 2867.869152] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
[ 2868.136518] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
[ 2868.141567] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[ 2868.672502] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 2868.937378] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[ 2868.937960] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 2868.937962] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[ 2868.937966] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features.
[ 2868.938054] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938057] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2868.938143] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938145] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2868.938230] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938231] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2868.938327] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000003e6bd737; ring_buffer_end = 00000000f0ed301f; write_frame = 000000002871204e
[ 2868.938328] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 2869.225087] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[ 2869.225090] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
[ 2869.999307] pci_bus 0000:05: busn_res: [bus 05] is released
[ 2870.000718] pci_bus 0000:04: busn_res: [bus 04-05] is released
[ 2870.001181] pci_bus 0000:03: busn_res: [bus 03-5f] is released
[ 2870.001989] pci_bus 0000:02: busn_res: [bus 02-5f] is released
However, on the immediate reconnect it fails to initialize. Apparently GPU was left in an inconsistent state:
[ 2984.317898] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[ 2984.317914] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[ 2984.318014] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[ 2984.318015] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[ 2984.318094] amdgpu 0000:05:00.0: amdgpu: failed to read discovery info from memory, vram size read: 0
[ 2984.318102] amdgpu 0000:05:00.0: amdgpu: [drm] *ERROR* discovery failed: -2
[ 2984.318105] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[ 2984.318108] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[ 2984.318123] amdgpu 0000:05:00.0: probe with driver amdgpu failed with error -2
[ 2984.318251] pci 0000:05:00.1: D0 power state depends on 0000:05:00.0
When I power cycled eGPU enclosure and reconnected the TB cable it finally initialized successfully:
[ 3098.412362] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[ 3098.412373] thunderbolt 0-2: TB4 HOME TB4 eGFX
[ 3099.141023] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833
[ 3099.256198] pcieport 0000:00:01.1: pciehp: Slot(0): Card present
[ 3099.256207] pcieport 0000:00:01.1: pciehp: Slot(0): Link Up
[ 3099.381658] pci 0000:01:00.0: [8086:1576] type 01 class 0x060400 PCIe Switch Upstream Port
[ 3099.381718] pci 0000:01:00.0: PCI bridge to [bus 00]
[ 3099.381736] pci 0000:01:00.0: bridge window [io 0x0000-0x0fff]
[ 3099.381744] pci 0000:01:00.0: bridge window [mem 0x00000000-0x000fffff]
[ 3099.381763] pci 0000:01:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.381785] pci 0000:01:00.0: enabling Extended Tags
[ 3099.382022] pci 0000:01:00.0: supports D1 D2
[ 3099.382025] pci 0000:01:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 3099.382818] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 8.000 Gb/s with 2.5 GT/s PCIe x4 link)
[ 3099.383117] pci 0000:01:00.0: Adding to iommu group 29
[ 3099.383318] pcieport 0000:00:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[ 3099.384983] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.385135] pci 0000:02:01.0: [8086:1576] type 01 class 0x060400 PCIe Switch Downstream Port
[ 3099.385184] pci 0000:02:01.0: PCI bridge to [bus 00]
[ 3099.385196] pci 0000:02:01.0: bridge window [io 0x0000-0x0fff]
[ 3099.385201] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff]
[ 3099.385217] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.385238] pci 0000:02:01.0: enabling Extended Tags
[ 3099.385384] pci 0000:02:01.0: supports D1 D2
[ 3099.385385] pci 0000:02:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 3099.385591] pci 0000:02:01.0: Adding to iommu group 30
[ 3099.385787] pci 0000:01:00.0: PCI bridge to [bus 02-5f]
[ 3099.385807] pci 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.385948] pci 0000:03:00.0: [1002:1478] type 01 class 0x060400 PCIe Switch Upstream Port
[ 3099.386015] pci 0000:03:00.0: BAR 0 [mem 0x00000000-0x00003fff]
[ 3099.386028] pci 0000:03:00.0: PCI bridge to [bus 00]
[ 3099.386045] pci 0000:03:00.0: bridge window [io 0x0000-0x0fff]
[ 3099.386053] pci 0000:03:00.0: bridge window [mem 0x00000000-0x000fffff]
[ 3099.386082] pci 0000:03:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.386385] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[ 3099.386630] pci 0000:03:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 3099.386978] pci 0000:03:00.0: Adding to iommu group 30
[ 3099.388966] pci 0000:02:01.0: PCI bridge to [bus 03-5f]
[ 3099.388989] pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.389170] pci 0000:04:00.0: [1002:1479] type 01 class 0x060400 PCIe Switch Downstream Port
[ 3099.389243] pci 0000:04:00.0: PCI bridge to [bus 00]
[ 3099.389260] pci 0000:04:00.0: bridge window [io 0x0000-0x0fff]
[ 3099.389267] pci 0000:04:00.0: bridge window [mem 0x00000000-0x000fffff]
[ 3099.389296] pci 0000:04:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 3099.389608] pci 0000:04:00.0: PME# supported from D0 D3hot D3cold
[ 3099.390187] pci 0000:04:00.0: Adding to iommu group 30
[ 3099.390312] pci 0000:03:00.0: PCI bridge to [bus 04-5f]
[ 3099.390348] pci 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 3099.390548] pci 0000:05:00.0: [1002:731f] type 00 class 0x030000 PCIe Legacy Endpoint
[ 3099.390666] pci 0000:05:00.0: BAR 0 [mem 0x00000000-0x0fffffff 64bit pref]
[ 3099.390675] pci 0000:05:00.0: BAR 2 [mem 0x00000000-0x001fffff 64bit pref]
[ 3099.390680] pci 0000:05:00.0: BAR 4 [io 0x0000-0x00ff]
[ 3099.390684] pci 0000:05:00.0: BAR 5 [mem 0x00000000-0x0007ffff]
[ 3099.390689] pci 0000:05:00.0: ROM [mem 0x00000000-0x0001ffff pref]
[ 3099.391108] pci 0000:05:00.0: PME# supported from D1 D2 D3hot D3cold
[ 3099.391458] pci 0000:05:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 3099.391663] pci 0000:05:00.0: Adding to iommu group 30
[ 3099.391687] pci 0000:05:00.0: vgaarb: setting as boot VGA device
[ 3099.391688] pci 0000:05:00.0: vgaarb: bridge control possible
[ 3099.391689] pci 0000:05:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 3099.391791] pci 0000:05:00.1: [1002:ab38] type 00 class 0x040300 PCIe Legacy Endpoint
[ 3099.391899] pci 0000:05:00.1: BAR 0 [mem 0x00000000-0x00003fff]
[ 3099.392144] pci 0000:05:00.1: PME# supported from D1 D2 D3hot D3cold
[ 3099.392437] pci 0000:05:00.1: Adding to iommu group 30
[ 3099.392584] pci 0000:04:00.0: PCI bridge to [bus 05-5f]
[ 3099.392616] pci_bus 0000:05: busn_res: [bus 05-5f] end is updated to 05
[ 3099.392628] pci_bus 0000:04: busn_res: [bus 04-5f] end is updated to 05
[ 3099.392637] pci_bus 0000:03: busn_res: [bus 03-5f] end is updated to 5f
[ 3099.392643] pci_bus 0000:02: busn_res: [bus 02-5f] end is updated to 5f
[ 3099.392663] pci 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392666] pci 0000:01:00.0: bridge window [mem 0x98000000-0xafffffff]: assigned
[ 3099.392667] pci 0000:01:00.0: bridge window [io 0x7000-0xafff]: assigned
[ 3099.392670] pci 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392672] pci 0000:02:01.0: bridge window [mem 0x98000000-0xafffffff]: assigned
[ 3099.392673] pci 0000:02:01.0: bridge window [io 0x7000-0xafff]: assigned
[ 3099.392676] pci 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392677] pci 0000:03:00.0: bridge window [mem 0x98000000-0xafefffff]: assigned
[ 3099.392678] pci 0000:03:00.0: BAR 0 [mem 0xaff00000-0xaff03fff]: assigned
[ 3099.392686] pci 0000:03:00.0: bridge window [io 0x7000-0xafff]: assigned
[ 3099.392688] pci 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3099.392690] pci 0000:04:00.0: bridge window [mem 0x98000000-0xafefffff]: assigned
[ 3099.392691] pci 0000:04:00.0: bridge window [io 0x7000-0xafff]: assigned
[ 3099.392694] pci 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: assigned
[ 3099.392717] pci 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: assigned
[ 3099.392739] pci 0000:05:00.0: BAR 5 [mem 0x98000000-0x9807ffff]: assigned
[ 3099.392747] pci 0000:05:00.0: ROM [mem 0x98080000-0x9809ffff pref]: assigned
[ 3099.392748] pci 0000:05:00.1: BAR 0 [mem 0x980a0000-0x980a3fff]: assigned
[ 3099.392756] pci 0000:05:00.0: BAR 4 [io 0x7000-0x70ff]: assigned
[ 3099.392763] pci 0000:04:00.0: PCI bridge to [bus 05]
[ 3099.392768] pci 0000:04:00.0: bridge window [io 0x7000-0xafff]
[ 3099.392778] pci 0000:04:00.0: bridge window [mem 0x98000000-0xafefffff]
[ 3099.392785] pci 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392798] pci 0000:03:00.0: PCI bridge to [bus 04-05]
[ 3099.392802] pci 0000:03:00.0: bridge window [io 0x7000-0xafff]
[ 3099.392812] pci 0000:03:00.0: bridge window [mem 0x98000000-0xafefffff]
[ 3099.392819] pci 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392832] pci 0000:02:01.0: PCI bridge to [bus 03-5f]
[ 3099.392834] pci 0000:02:01.0: bridge window [io 0x7000-0xafff]
[ 3099.392840] pci 0000:02:01.0: bridge window [mem 0x98000000-0xafffffff]
[ 3099.392844] pci 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392851] pci 0000:01:00.0: PCI bridge to [bus 02-5f]
[ 3099.392853] pci 0000:01:00.0: bridge window [io 0x7000-0xafff]
[ 3099.392859] pci 0000:01:00.0: bridge window [mem 0x98000000-0xafffffff]
[ 3099.392869] pci 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.392889] pcieport 0000:00:01.1: PCI bridge to [bus 01-5f]
[ 3099.392891] pcieport 0000:00:01.1: bridge window [io 0x7000-0xafff]
[ 3099.392893] pcieport 0000:00:01.1: bridge window [mem 0x98000000-0xafffffff]
[ 3099.392896] pcieport 0000:00:01.1: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3099.393149] pcieport 0000:01:00.0: enabling device (0000 -> 0003)
[ 3099.393365] pcieport 0000:02:01.0: enabling device (0000 -> 0003)
[ 3099.393531] pcieport 0000:02:01.0: pciehp: Slot #1 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
[ 3099.394106] pcieport 0000:03:00.0: enabling device (0000 -> 0003)
[ 3099.394335] pcieport 0000:04:00.0: enabling device (0000 -> 0003)
[ 3099.394688] pci 0000:05:00.0: disabling ATS
[ 3099.394799] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[ 3099.394816] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[ 3099.394904] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[ 3099.394906] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[ 3101.362812] amdgpu 0000:05:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (nv_common)
[ 3101.362824] amdgpu 0000:05:00.0: amdgpu: detected ip block number 1 <gmc_v10_0_0> (gmc_v10_0)
[ 3101.362827] amdgpu 0000:05:00.0: amdgpu: detected ip block number 2 <ih_v5_0_0> (navi10_ih)
[ 3101.362829] amdgpu 0000:05:00.0: amdgpu: detected ip block number 3 <psp_v11_0_0> (psp)
[ 3101.362832] amdgpu 0000:05:00.0: amdgpu: detected ip block number 4 <smu_v11_0_0> (smu)
[ 3101.362834] amdgpu 0000:05:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[ 3101.362836] amdgpu 0000:05:00.0: amdgpu: detected ip block number 6 <gfx_v10_0_0> (gfx_v10_0)
[ 3101.362838] amdgpu 0000:05:00.0: amdgpu: detected ip block number 7 <sdma_v5_0_0> (sdma_v5_0)
[ 3101.362840] amdgpu 0000:05:00.0: amdgpu: detected ip block number 8 <vcn_v2_0_0> (vcn_v2_0)
[ 3101.362842] amdgpu 0000:05:00.0: amdgpu: detected ip block number 9 <jpeg_v2_0_0> (jpeg_v2_0)
[ 3101.362881] amdgpu 0000:05:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[ 3101.493791] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 3101.493799] amdgpu: ATOM BIOS: xxx-xxx-xxx
[ 3101.496058] amdgpu 0000:05:00.0: vgaarb: deactivate vga console
[ 3101.496063] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 3101.496081] amdgpu 0000:05:00.0: amdgpu: PCIE atomic ops is not supported
[ 3101.496087] amdgpu 0000:05:00.0: amdgpu: GPU posting now...
[ 3101.496185] amdgpu 0000:05:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 3101.496216] amdgpu 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: releasing
[ 3101.496221] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: releasing
[ 3101.496243] pcieport 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496245] pcieport 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496246] pcieport 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496248] pcieport 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496249] pcieport 0000:00:01.1: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: releasing
[ 3101.496261] pcieport 0000:00:01.1: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496264] pcieport 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496266] pcieport 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496268] pcieport 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496269] pcieport 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[ 3101.496271] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x39ffffffff 64bit pref]: assigned
[ 3101.496287] amdgpu 0000:05:00.0: BAR 2 [mem 0x3a00000000-0x3a001fffff 64bit pref]: assigned
[ 3101.496303] pcieport 0000:00:01.1: PCI bridge to [bus 01-5f]
[ 3101.496305] pcieport 0000:00:01.1: bridge window [io 0x7000-0xafff]
[ 3101.496308] pcieport 0000:00:01.1: bridge window [mem 0x98000000-0xafffffff]
[ 3101.496311] pcieport 0000:00:01.1: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496314] pcieport 0000:01:00.0: PCI bridge to [bus 02-5f]
[ 3101.496316] pcieport 0000:01:00.0: bridge window [io 0x7000-0xafff]
[ 3101.496321] pcieport 0000:01:00.0: bridge window [mem 0x98000000-0xafffffff]
[ 3101.496325] pcieport 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496331] pcieport 0000:02:01.0: PCI bridge to [bus 03-5f]
[ 3101.496333] pcieport 0000:02:01.0: bridge window [io 0x7000-0xafff]
[ 3101.496338] pcieport 0000:02:01.0: bridge window [mem 0x98000000-0xafffffff]
[ 3101.496342] pcieport 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496348] pcieport 0000:03:00.0: PCI bridge to [bus 04-05]
[ 3101.496351] pcieport 0000:03:00.0: bridge window [io 0x7000-0xafff]
[ 3101.496358] pcieport 0000:03:00.0: bridge window [mem 0x98000000-0xafefffff]
[ 3101.496363] pcieport 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496372] pcieport 0000:04:00.0: PCI bridge to [bus 05]
[ 3101.496375] pcieport 0000:04:00.0: bridge window [io 0x7000-0xafff]
[ 3101.496382] pcieport 0000:04:00.0: bridge window [mem 0x98000000-0xafefffff]
[ 3101.496387] pcieport 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[ 3101.496401] amdgpu 0000:05:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 3101.496404] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 3101.496425] [drm] Detected VRAM RAM=8176M, BAR=8192M
[ 3101.496426] [drm] RAM width 256bits GDDR6
[ 3101.496567] amdgpu 0000:05:00.0: amdgpu: amdgpu: 8176M of VRAM memory ready
[ 3101.496570] amdgpu 0000:05:00.0: amdgpu: amdgpu: 31787M of GTT memory ready.
[ 3101.496601] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 3101.496776] [drm] PCIE GART of 512M enabled (table at 0x00000081FEE00000).
[ 3101.498032] amdgpu 0000:05:00.0: amdgpu: [VCN instance 0] Found VCN firmware Version ENC: 1.21 DEC: 7 VEP: 0 Revision: 2
[ 3101.553873] amdgpu 0000:05:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 3101.597877] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 3101.603756] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 3101.603758] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 3101.603853] amdgpu 0000:05:00.0: amdgpu: use vbios provided pptable
[ 3101.603856] amdgpu 0000:05:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 3101.640541] amdgpu 0000:05:00.0: amdgpu: SMU is initialized successfully!
[ 3101.640967] amdgpu 0000:05:00.0: amdgpu: [drm] Display Core v3.2.351 initialized on DCN 2.0
[ 3101.640970] amdgpu 0000:05:00.0: amdgpu: [drm] DP-HDMI FRL PCON supported
[ 3101.648040] amdgpu 0000:05:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
[ 3101.686201] amdgpu: HMM registered 8176MB device memory
[ 3102.188920] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 3102.189044] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 3102.189120] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 3102.693189] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 3102.693576] amdgpu: Virtual CRAT table created for GPU
[ 3102.694375] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[ 3102.694382] kfd kfd: amdgpu: added device 1002:731f
[ 3102.694416] amdgpu 0000:05:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 40
[ 3102.694423] amdgpu 0000:05:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 3102.694426] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 3102.694428] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 3102.694429] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 3102.694430] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 3102.694431] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 3102.694433] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 3102.694434] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 3102.694435] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 3102.694436] amdgpu 0000:05:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 3102.694437] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 3102.694439] amdgpu 0000:05:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 3102.694440] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 3102.694441] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 3102.694442] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 3102.694443] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 3102.698096] amdgpu 0000:05:00.0: amdgpu: Using BOCO for runtime pm
[ 3102.700204] amdgpu 0000:05:00.0: [drm] Registered 6 planes with drm panic
[ 3102.700208] [drm] Initialized amdgpu 3.64.0 for 0000:05:00.0 on minor 2
[ 3102.703170] amdgpu 0000:05:00.0: [drm] Cannot find any crtc or sizes
However, the thing is… now I am unable to make ollama to use ROCm ![]()
дек 18 14:10:28 fw13 ollama[9888]: time=2025-12-18T14:10:28.997+05:00 level=INFO source=runner.go:464 msg="failure during GPU discov
ery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:
0]" error="runner crashed"
rocminfo initially showed this:
korvin@fw13:~$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
after I power cycled the eGPU it started doing this:
korvin@fw13:~$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Invalid argument
korvin is member of render group
OLLAMA_VULKAN=1 mode works, though. So… partial success?
Can you please check /sys/class/kfd after you unplugged to see if eGPU node is still present? It should have removed one entry.
Indeed, that’s exactly what happens.
After clean boot I have two entries in /sys/class/kfd/topology/nodes/:
0 with gpu_id 01 with gpu_id 35022 which is probably eGPUWhen I yank the cable and plug it back in, there appears node/2, again, with gpu_id 35022. Every time I plug eGPU in, a new node appears.
I noticed that rocminfo suceeds before the disconnect and then fails like this:
korvin@fw13:/sys/class/kfd$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
Apparently it accesses node 1 which is now defunct.
P.S.: Just in case, here’s the clean rocm output:
e[37mROCk module is loadede[0m
=====================
HSA System Attributes
=====================
Runtime Version: 1.15
Runtime Ext Version: 1.7
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Uuid: CPU-XX
Marketing Name: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5157
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65101096(0x3e15d28) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 65101096(0x3e15d28) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65101096(0x3e15d28) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65101096(0x3e15d28) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1150
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5390(0x150e)
ASIC Revision: 4(0x4)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 2900
BDFID: 49408
Internal Node ID: 1
Compute Unit: 16
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 31
SDMA engine uCode:: 14
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32550548(0x1f0ae94) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32550548(0x1f0ae94) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1150
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
OK - I think what might be going on is that by the “surprise hotplug” we never get a chance to clean up kfd software nodes then.
See if this helps:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4c8b03b6bf5..f40a83be2cac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5263,6 +5263,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);
+ /* surprise hotplug */
+ if (pci_dev_is_disconnected(adev->pdev))
+ amdgpu_amdkfd_device_fini_sw(adev);
}
void amdgpu_device_fini_sw(struct amdgpu_device *adev)
Quite a quest here.
I am trying to patch build the amdgpu module for my kernel. However, the kernel required gcc-15 which is not present in my Kubuntu 24.04, so I have to build the compiler from sources.
To save time I am trying to patch and build only one module instead of building the whole kernel. Don’t sure if it’d work.
In the end I managed to build the kernel module successfully, but when I try to make modules_install it fails:
korvin@fw13:~/work/kernel/mainline-crack/drivers/gpu/drm/amd/amdgpu$ sudo make -C /lib/modules/$(uname -r)/build M=$PWD modules_install
[sudo] password for korvin:
make: Entering directory '/usr/src/linux-headers-6.18.0-061800-generic'
make[1]: Entering directory '/home/korvin/work/kernel/mainline-crack/drivers/gpu/drm/amd/amdgpu'
INSTALL /lib/modules/6.18.0-061800-generic/updates/amdgpu.ko
SIGN /lib/modules/6.18.0-061800-generic/updates/amdgpu.ko
At main.c:171:
- SSL error:FFFFFFFF80000002:system library::No such file or directory: ../crypto/bio/bss_file.c:67
- SSL error:10000080:BIO routines::no such file: ../crypto/bio/bss_file.c:75
sign-file: /usr/src/linux-headers-6.18.0-061800-generic/certs/signing_key.pem
DEPMOD /lib/modules/6.18.0-061800-generic
Warning: modules_install: missing 'System.map' file. Skipping depmod.
make[1]: Leaving directory '/home/korvin/work/kernel/mainline-crack/drivers/gpu/drm/amd/amdgpu'
make: Leaving directory '/usr/src/linux-headers-6.18.0-061800-generic'
Could you give me a hint on what to do next, please?
P.S.: Last time I build the kernel myself was quite a while ago, in the era of kernel 2.4.18 and 2.5.65. A lot changed since then ![]()
I wouldn’t build using the Ubuntu packaging. It’s a pain in the butt to get right. Upstream kernel source has Deb packaging you can use.
Clone the mainline or stable tree.
Apply patch
Add the kernel config in place. If you use the one from your current kernel you’ll need to turn off some of the cert options.
Then build using make bindeb-pkg -j$(nproc).
You’ll get a deb to install.
Any progress with testing that?
Hi Mario,
I tried doing it that day, but failed to build the kernel due to missing dwarf.h. Every tutorial I seen mentioned libdwarf-dev but it was installed and yet, the issue remained. Then I was busy with my discrete ML research.
Also in the meantime I used OLLAMA_VULKAN which appeared to work decently on my setup, with the exception of very slow load times on iGPU.
At last, today I found libdw-dev that apparently fixed the missing header issue. Will try testing your patch today.
Ok, I’ve just built vanilla 6.18.0 with your patch applied.
After clean boot with no eGPU attached:
korvin@fw13:~$ uname -r
6.18.0
$ ls /sys/class/kfd/kfd/topology/nodes/
0 1
Here’s ROCm output:
e[37mROCk module is loadede[0m
=====================
HSA System Attributes
=====================
Runtime Version: 1.15
Runtime Ext Version: 1.7
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Uuid: CPU-XX
Marketing Name: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5157
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 49112696(0x2ed6678) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 49112696(0x2ed6678) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 49112696(0x2ed6678) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 49112696(0x2ed6678) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1150
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5390(0x150e)
ASIC Revision: 4(0x4)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 2900
BDFID: 49408
Internal Node ID: 1
Compute Unit: 16
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 31
SDMA engine uCode:: 14
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 24556348(0x176b33c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 24556348(0x176b33c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1150
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Relevant ollama log:
янв 04 15:39:29 fw13 ollama[6588]: time=2026-01-04T15:39:29.029+05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
янв 04 15:39:29 fw13 ollama[6588]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: found 1 ROCm devices:
янв 04 15:39:29 fw13 ollama[6588]: ggml_cuda_init: initializing rocBLAS on device 0
янв 04 15:39:29 fw13 ollama[6588]: rocBLAS error: Cannot read /usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1150
янв 04 15:39:29 fw13 ollama[6588]: List of available TensileLibrary Files :
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx908.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1151.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx942.dat"
янв 04 15:39:29 fw13 ollama[6588]: "/usr/local/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" error="runner crashed"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" devices=[]
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=1.02787478s OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=DEBUG source=runner.go:153 msg="filtering device which didn't fully initialize" id=0 libdir=/usr/local/lib/ollama/rocm pci_id=0000:c1:00.0 library=ROCm
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=TRACE source=runner.go:174 msg="supported GPU library combinations before filtering" supported=map[]
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=TRACE source=runner.go:183 msg="removing unsupported or overlapping GPU combination" libDir=/usr/local/lib/ollama/rocm description="AMD Radeon Graphics" compute=gfx1150 pci_id=0000:c1:00.0
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=2.274042056s
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="46.8 GiB" available="40.1 GiB"
янв 04 15:39:30 fw13 ollama[6588]: time=2026-01-04T15:39:30.029+05:00 level=INFO source=routes.go:1648 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
Now I attach eGPU TB4 cable:
[ 663.694730] amdgpu: HMM registered 8176MB device memory
[ 664.195196] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 664.195930] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 664.196007] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 664.699497] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 664.699888] amdgpu: Virtual CRAT table created for GPU
[ 664.700462] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[ 664.700470] kfd kfd: amdgpu: added device 1002:731f
...
korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0 1 2
korvin@fw13:~$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
After I restarted ollama eGPU was recognized and I was able to run qwen3:8b model on it.
Now I yank the TB4 cable:
[ 916.452198] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
[ 916.452206] thunderbolt 0-0:2.1: retimer disconnected
[ 916.452213] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[ 916.453612] thunderbolt 0-2: device disconnected
[ 916.514923] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[ 916.514945] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[ 916.619952] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[ 916.743173] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 916.743185] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[ 916.743194] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[ 916.743279] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 916.743281] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
...
[ 917.282084] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[ 917.334190] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
[ 917.605625] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
[ 917.611390] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[ 918.216273] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 918.517956] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[ 918.518720] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[ 918.518723] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[ 918.518730] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features.
[ 918.518819] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[ 918.518824] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 918.518911] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[ 918.518914] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 918.519001] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[ 918.519003] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 918.519089] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000000343f0c4; ring_buffer_end = 0000000016e17b8b; write_frame = 00000000b9667d0c
[ 918.519092] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[ 918.842104] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[ 918.842111] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
[ 919.591291] pci_bus 0000:05: busn_res: [bus 05] is released
[ 919.591409] pci_bus 0000:04: busn_res: [bus 04-05] is released
[ 919.591519] pci_bus 0000:03: busn_res: [bus 03-5f] is released
[ 919.592608] pci_bus 0000:02: busn_res: [bus 02-5f] is released
Node 2 was removed successfully!
korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0 1
However, rocminfo still does not work:
korvin@fw13:~$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Invalid argument
korvin is member of render group
I reconnected the TB4 cable and apparently eGPU was initialized successfully with no errors in dmesg:
[ 1044.936694] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[ 1044.936717] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[ 1044.937008] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[ 1044.937011] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[ 1046.902212] amdgpu 0000:05:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (nv_common)
[ 1046.902227] amdgpu 0000:05:00.0: amdgpu: detected ip block number 1 <gmc_v10_0_0> (gmc_v10_0)
[ 1046.902231] amdgpu 0000:05:00.0: amdgpu: detected ip block number 2 <ih_v5_0_0> (navi10_ih)
[ 1046.902234] amdgpu 0000:05:00.0: amdgpu: detected ip block number 3 <psp_v11_0_0> (psp)
[ 1046.902237] amdgpu 0000:05:00.0: amdgpu: detected ip block number 4 <smu_v11_0_0> (smu)
[ 1046.902240] amdgpu 0000:05:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[ 1046.902243] amdgpu 0000:05:00.0: amdgpu: detected ip block number 6 <gfx_v10_0_0> (gfx_v10_0)
[ 1046.902246] amdgpu 0000:05:00.0: amdgpu: detected ip block number 7 <sdma_v5_0_0> (sdma_v5_0)
[ 1046.902249] amdgpu 0000:05:00.0: amdgpu: detected ip block number 8 <vcn_v2_0_0> (vcn_v2_0)
[ 1046.902251] amdgpu 0000:05:00.0: amdgpu: detected ip block number 9 <jpeg_v2_0_0> (jpeg_v2_0)
[ 1046.902284] amdgpu 0000:05:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[ 1047.034560] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 1047.034569] amdgpu: ATOM BIOS: xxx-xxx-xxx
[ 1047.038291] amdgpu 0000:05:00.0: vgaarb: deactivate vga console
[ 1047.038296] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 1047.038314] amdgpu 0000:05:00.0: amdgpu: PCIE atomic ops is not supported
[ 1047.038321] amdgpu 0000:05:00.0: amdgpu: GPU posting now...
[ 1047.038439] amdgpu 0000:05:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 1047.038466] amdgpu 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: releasing
[ 1047.038472] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: releasing
...
[ 1047.038539] amdgpu 0000:05:00.0: BAR 0 [mem 0x3800000000-0x39ffffffff 64bit pref]: assigned
[ 1047.038558] amdgpu 0000:05:00.0: BAR 2 [mem 0x3a00000000-0x3a001fffff 64bit pref]: assigned
...
[ 1047.038703] amdgpu 0000:05:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 1047.038707] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 1047.038731] [drm] Detected VRAM RAM=8176M, BAR=8192M
[ 1047.038733] [drm] RAM width 256bits GDDR6
[ 1047.038912] amdgpu 0000:05:00.0: amdgpu: amdgpu: 8176M of VRAM memory ready
[ 1047.038915] amdgpu 0000:05:00.0: amdgpu: amdgpu: 23980M of GTT memory ready.
[ 1047.038958] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 1047.039204] [drm] PCIE GART of 512M enabled (table at 0x00000081FEE00000).
[ 1047.040824] amdgpu 0000:05:00.0: amdgpu: [VCN instance 0] Found VCN firmware Version ENC: 1.21 DEC: 7 VEP: 0 Revision: 2
[ 1047.097013] amdgpu 0000:05:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 1047.141891] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 1047.147893] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1047.147896] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 1047.147989] amdgpu 0000:05:00.0: amdgpu: use vbios provided pptable
[ 1047.147992] amdgpu 0000:05:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 1047.184943] amdgpu 0000:05:00.0: amdgpu: SMU is initialized successfully!
[ 1047.185661] amdgpu 0000:05:00.0: amdgpu: [drm] Display Core v3.2.351 initialized on DCN 2.0
[ 1047.185665] amdgpu 0000:05:00.0: amdgpu: [drm] DP-HDMI FRL PCON supported
[ 1047.193038] amdgpu 0000:05:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
[ 1047.242675] amdgpu: HMM registered 8176MB device memory
[ 1047.746506] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 1047.746892] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 1047.746973] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 1048.250264] amdgpu 0000:05:00.0: amdgpu: Fence fallback timer expired on ring sdma0
[ 1048.250668] amdgpu: Virtual CRAT table created for GPU
[ 1048.251169] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[ 1048.251177] kfd kfd: amdgpu: added device 1002:731f
[ 1048.251267] amdgpu 0000:05:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 40
[ 1048.251276] amdgpu 0000:05:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1048.251280] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1048.251282] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1048.251285] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 1048.251287] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 1048.251289] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 1048.251290] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 1048.251292] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 1048.251293] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 1048.251295] amdgpu 0000:05:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 1048.251297] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1048.251299] amdgpu 0000:05:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 1048.251300] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 1048.251302] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 1048.251304] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 1048.251305] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 1048.255110] amdgpu 0000:05:00.0: amdgpu: Using BOCO for runtime pm
[ 1048.257697] amdgpu 0000:05:00.0: [drm] Registered 6 planes with drm panic
[ 1048.257704] [drm] Initialized amdgpu 3.64.0 for 0000:05:00.0 on minor 2
[ 1048.263330] amdgpu 0000:05:00.0: [drm] Cannot find any crtc or sizes
[ 1048.263726] pci 0000:05:00.1: D0 power state depends on 0000:05:00.0
However, after I restarted ollama it failed to pick up the eGPU:
янв 04 15:54:10 fw13 ollama[9976]: time=2026-01-04T15:54:10.341+05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
янв 04 15:54:10 fw13 ollama[9976]: ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
янв 04 15:54:10 fw13 ollama[9976]: load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so
rocminfo still does not work:
korvin@fw13:~$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Invalid argument
korvin is member of render group
So, in the end, reconnect still fails but now sys node ids do not leak.
Ok that’s great progress at least. Would you be able to trace where the EINVAL is happening from the ioctl call?
Hm, that’s strange.
I did strace rocminfo, the only calls that return EINVAL are:
openat(AT_FDCWD, "/dev/kfd", O_RDWR) = -1 EINVAL (Invalid argument)
write(1, "\33[31mUnable to open /dev/kfd rea"..., 62Unable to open /dev/kfd read-write: Invalid argument
...
prctl(PR_CAPBSET_READ, CAP_MAC_OVERRIDE) = 1
prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, CAP_CHECKPOINT_RESTORE) = 1
prctl(PR_CAPBSET_READ, 0x2c /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x2a /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x29 /* CAP_??? */) = -1 EINVAL (Invalid argument)
munmap(0x708ba1dc6000, 114647) = 0
The file itself exists and is writable:
korvin@fw13:~$ ls -l /dev/kfd
crw-rw---- 1 root render 509, 0 янв 4 19:52 /dev/kfd
App armor is now disabled.