AMD eGPU on Linux

Korvin · January 13, 2026, 4:22pm

I redid the test to be sure. This time I just yanked the cable and waited (didn’t attempt to reconnect).

After the disconnect, the ollama process was hanging for the whole time for ~2 minutes, afterwards it crashed due to general protection fault in the same manner as before.

So it definitely wasn’t killed by your patch.

Also I’ve noticed that right after disconnect one of CPU cores was 100% busy for, like 10 seconds.

Mario_Limonciello · January 13, 2026, 4:26pm

OK, so if you wait for the process to actually get killed and then plug things back in does it recover more nicely?

Korvin · January 13, 2026, 4:42pm

Hm… looks like it’s even weirder. The kernel reported protection fault, but apparently the process remained.

korvin@fw13:~$ ps -A | grep ollama
   2556 ?        00:00:00 ollama    # This process was set up by the system

# Here I connect eGPU

korvin@fw13:~$ sudo service ollama restart # To pick up newly connected device

korvin@fw13:~$ ps -A | grep ollama
   4521 ?        00:00:00 ollama    # The new process

# Just to check if it works
korvin@fw13:~$ ollama run qwen3:8b
>>> hello world
Thinking...
Okay, the user said "hello world". That's a common starting point for programming, especially in the 
context of learning a new language. But since^C
>>> 

korvin@fw13:~$ ps -A | grep ollama
   4521 ?        00:00:00 ollama
   4667 ?        00:00:43 ollama

# Here I yank the cable.

# These two processes remain the same for several minutes even after the crash
korvin@fw13:~$ ps -A | grep ollama
   4521 ?        00:00:00 ollama
   4667 ?        00:00:43 ollama

However the ollama log shows the following after the crash:

янв 13 21:28:53 fw13 ollama[4521]: time=2026-01-13T21:28:53.356+05:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
янв 13 21:28:53 fw13 ollama[4521]: time=2026-01-13T21:28:53.357+05:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
янв 13 21:28:55 fw13 ollama[4521]: time=2026-01-13T21:28:55.875+05:00 level=INFO source=server.go:1376 msg="llama runner started in 4.09 seconds"
янв 13 21:28:55 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:28:55 | 200 |  5.582143281s |       127.0.0.1 | POST     "/api/generate"
янв 13 21:29:00 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:29:00 | 200 |  1.106192866s |       127.0.0.1 | POST     "/api/chat"


*** Here I yank the cable ***


янв 13 21:34:00 fw13 ollama[4521]: ggml_hip_get_device_memory searching for device 0000:05:00.0
янв 13 21:34:00 fw13 ollama[4521]: ggml_hip_get_device_memory unable to find matching device
янв 13 21:34:03 fw13 ollama[4521]: time=2026-01-13T21:34:03.599+05:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42439"
янв 13 21:34:03 fw13 ollama[4521]: time=2026-01-13T21:34:03.758+05:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.346+05:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41533"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.347+05:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs=map[ROCR_VISIBLE_DEVICES:1] error="failed to finish discovery before timeout"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.347+05:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.347+05:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 40101"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.348+05:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs=map[ROCR_VISIBLE_DEVICES:1] error="failed to finish discovery before timeout"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.348+05:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
янв 13 21:36:34 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:36:34 | 200 |    1.890426ms |       127.0.0.1 | GET      "/api/tags"
янв 13 21:36:34 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:36:34 | 200 |       84.57µs |       127.0.0.1 | GET      "/api/ps"

Note PID 4521 remains. Looks like it was stalled for some time, then remained logging.

Korvin · January 13, 2026, 4:46pm

Unfortunately not. I tried to reconnect the cable 10 minutes after and still got the same behavior:

[ 1063.256910] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[ 1063.256924] thunderbolt 0-2: TB4 HOME TB4 eGFX
[ 1063.988395] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833
(that's all)

This time, however, there are no busy CPU cores, all just idling.

Nothing happens afterwards even if I reconnect the cable. Only retimer events are reported and that’s all.

Mario_Limonciello · January 13, 2026, 5:03pm

Does it show in lspci during this failure period?

What if you manually kill the ollama process, does that help at all?

Korvin · January 13, 2026, 6:38pm

Yes, the device does indeed show up in lspci upon reconnect:

05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c1)
05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio

However, that does not help at all. Even if I yank the cable, lspci keeps showing the same entry. So device enumeration was totally broken at that point. Apparently it wasn’t removed the very first time I disconnected the cable and then just stayed like that.

The only kernel reaction is:

[ 5935.005041] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[ 5935.005054] thunderbolt 0-2: TB4 HOME TB4 eGFX
[ 5935.732408] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833
[ 6127.051342] thunderbolt 0-0:2.1: retimer disconnected
[ 6127.052939] thunderbolt 0-2: device disconnected
[ 6301.087095] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[ 6301.087109] thunderbolt 0-2: TB4 HOME TB4 eGFX
[ 6301.815200] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833

Manually killing ollama with SIGKILL removes the process but does not make any difference.

Mario_Limonciello · January 13, 2026, 6:57pm

It sounds like the PCIe hotplugging thread got totally busted. It sorta matches the warning you saw above where it was trying to release resources.

[  369.859595] INFO: task irq/34-pciehp:200 blocked for more than 122 seconds.

This is getting deep into MM, it’s not clear immediately if it was caused by my patch or an existing issue though. I think for now I’ll wait for code review on it for more comments. For now I’d say make sure you stop ollama before unplugging.

Korvin · January 13, 2026, 7:11pm

I see. Let’s hope it would be resolved eventually. Thank you anyway!

P.S.: Just my speculation, could it be that pdd->dev->dqm->ops.process_termination in your patch causes normal resource deallocation which, in turn, tries to acquire already locked kfd_processes_srcu?

Mario_Limonciello · January 13, 2026, 8:28pm

I don’t think so. The locking is I didn’t see any direct references in the call path at least.

Topic		Replies	Views
[RESPONDED] Radeon 7900XTX eGPU can't initialize in Linux (AMD 7840U Framework) Linux	28	2705	May 8, 2024
[SOLVED] Help: Fedora Linux 37 AMD eGPU drivers (Davinci Resolve OpenCL) Linux	9	6974	March 20, 2023
eGPU issues with AMD FW 16 Linux fedora	33	3262	June 29, 2024
[Announcement] Linux on your Framework Laptop 13 (AMD Ryzen 7040 Series) Linux	135	32590	August 5, 2025
[RESOLVED] NVIDIA drivers failing to load eGPU on Ubuntu 22.04.1 BIOS 3.06 beta Linux	25	19662	December 27, 2023

AMD eGPU on Linux

Related topics