I redid the test to be sure. This time I just yanked the cable and waited (didn’t attempt to reconnect).
After the disconnect, the ollama process was hanging for the whole time for ~2 minutes, afterwards it crashed due to general protection fault in the same manner as before.
So it definitely wasn’t killed by your patch.
Also I’ve noticed that right after disconnect one of CPU cores was 100% busy for, like 10 seconds.
Hm… looks like it’s even weirder. The kernel reported protection fault, but apparently the process remained.
korvin@fw13:~$ ps -A | grep ollama
2556 ? 00:00:00 ollama # This process was set up by the system
# Here I connect eGPU
korvin@fw13:~$ sudo service ollama restart # To pick up newly connected device
korvin@fw13:~$ ps -A | grep ollama
4521 ? 00:00:00 ollama # The new process
# Just to check if it works
korvin@fw13:~$ ollama run qwen3:8b
>>> hello world
Thinking...
Okay, the user said "hello world". That's a common starting point for programming, especially in the
context of learning a new language. But since^C
>>>
korvin@fw13:~$ ps -A | grep ollama
4521 ? 00:00:00 ollama
4667 ? 00:00:43 ollama
# Here I yank the cable.
# These two processes remain the same for several minutes even after the crash
korvin@fw13:~$ ps -A | grep ollama
4521 ? 00:00:00 ollama
4667 ? 00:00:43 ollama
However the ollama log shows the following after the crash:
янв 13 21:28:53 fw13 ollama[4521]: time=2026-01-13T21:28:53.356+05:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
янв 13 21:28:53 fw13 ollama[4521]: time=2026-01-13T21:28:53.357+05:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
янв 13 21:28:55 fw13 ollama[4521]: time=2026-01-13T21:28:55.875+05:00 level=INFO source=server.go:1376 msg="llama runner started in 4.09 seconds"
янв 13 21:28:55 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:28:55 | 200 | 5.582143281s | 127.0.0.1 | POST "/api/generate"
янв 13 21:29:00 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:29:00 | 200 | 1.106192866s | 127.0.0.1 | POST "/api/chat"
*** Here I yank the cable ***
янв 13 21:34:00 fw13 ollama[4521]: ggml_hip_get_device_memory searching for device 0000:05:00.0
янв 13 21:34:00 fw13 ollama[4521]: ggml_hip_get_device_memory unable to find matching device
янв 13 21:34:03 fw13 ollama[4521]: time=2026-01-13T21:34:03.599+05:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42439"
янв 13 21:34:03 fw13 ollama[4521]: time=2026-01-13T21:34:03.758+05:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.346+05:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41533"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.347+05:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs=map[ROCR_VISIBLE_DEVICES:1] error="failed to finish discovery before timeout"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.347+05:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.347+05:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 40101"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.348+05:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/rocm]" extra_envs=map[ROCR_VISIBLE_DEVICES:1] error="failed to finish discovery before timeout"
янв 13 21:34:05 fw13 ollama[4521]: time=2026-01-13T21:34:05.348+05:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
янв 13 21:36:34 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:36:34 | 200 | 1.890426ms | 127.0.0.1 | GET "/api/tags"
янв 13 21:36:34 fw13 ollama[4521]: [GIN] 2026/01/13 - 21:36:34 | 200 | 84.57µs | 127.0.0.1 | GET "/api/ps"
Note PID 4521 remains. Looks like it was stalled for some time, then remained logging.
However, that does not help at all. Even if I yank the cable, lspci keeps showing the same entry. So device enumeration was totally broken at that point. Apparently it wasn’t removed the very first time I disconnected the cable and then just stayed like that.
It sounds like the PCIe hotplugging thread got totally busted. It sorta matches the warning you saw above where it was trying to release resources.
[ 369.859595] INFO: task irq/34-pciehp:200 blocked for more than 122 seconds.
This is getting deep into MM, it’s not clear immediately if it was caused by my patch or an existing issue though. I think for now I’ll wait for code review on it for more comments. For now I’d say make sure you stop ollama before unplugging.
I see. Let’s hope it would be resolved eventually. Thank you anyway!
P.S.: Just my speculation, could it be that pdd->dev->dqm->ops.process_termination in your patch causes normal resource deallocation which, in turn, tries to acquire already locked kfd_processes_srcu?