Ollama - Model Runner Unexpectedly Stopped (GPU Hang)

Anyone else experiencing issues running Ollama on Ubuntu? Sporadically getting the following error running models of various sizes:

Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details
Oct 01 20:23:05 llama ollama[2180]: HW Exception by GPU node-1 (Agent handle: 0x723f9c692ba0) reason :GPU Hang

FWIW, I’m using Ubuntu 24.04 with the HWE kernel and I installed the ROCM drivers via AMD’s guide:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.3 LTS
Release:	24.04
Codename:	noble

$ uname -a
Linux llama 6.14.0-33-generic #33~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 19 17:02:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep rocm-core
ii  rocm-core                             7.0.1.70001-42~24.04                    amd64        ROCm Runtime software stack

Can you attach a full “sudo dmesg” output. We can then see what sort of failure it was.

Here’s what I got:

[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: Failed to evict queue 1
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: Failed to evict process queues
[Sat Oct 11 22:19:13 2025] amdgpu: Failed to quiesce KFD
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: GPU reset begin!
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: Dumping IP State
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: Dumping IP State Completed
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: MODE2 reset
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: GPU reset succeeded, trying to resume
[Sat Oct 11 22:19:13 2025] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: SMU is resuming...
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: SMU is resumed successfully!
[Sat Oct 11 22:19:13 2025] amdgpu: Freeing queue vital buffer 0x721f2c000000, queue evicted
[Sat Oct 11 22:19:13 2025] amdgpu: Freeing queue vital buffer 0x721f32c00000, queue evicted
[Sat Oct 11 22:19:13 2025] amdgpu: Freeing queue vital buffer 0x721f4b200000, queue evicted
[Sat Oct 11 22:19:13 2025] amdgpu: Freeing queue vital buffer 0x721f69400000, queue evicted
[Sat Oct 11 22:19:13 2025] amdgpu: Freeing queue vital buffer 0x721f6aa00000, queue evicted
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09002600
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 4 on hub 8
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 6 on hub 8
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: ring vpe uses VM inv eng 7 on hub 8
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: amdgpu: GPU reset(88) succeeded!
[Sat Oct 11 22:19:13 2025] amdgpu 0000:c2:00.0: [drm] device wedged, but recovered through reset

I also have the coredump from: /sys/class/drm/card1/device/devcoredump/data) if that helps. I’d have to throw it in google drive or something.

I have something that helped on my FW16, and it had a similar, but not exactly the same, error in dmesg.
You are on a FW Desktop, so the work around might not help you there, but worth a try.

It is a bit of a work around really, and might cause some games to stop working.
But, it seems to fix ROCM / LLM problems for me.

1 Like

I’ll check it out, thanks!

This looks like the long run compute issue. There is a fix in the 6.14 OEM proposed kernel (-1014 is the kernel version). Enable proposed and try that kernel.

Note: fix is NOT in HWE. It’s only in OEM.

2 Likes

Also the optimizations for Ryzen AI and Ryzen AI Max are only in 6.4.4 or 7.0.2. I think you followed instinct instructions. Correct link:

2 Likes

Thanks for the info!

On a plane right now but I’ll try it out as soon as I can :+1:t2:

@Mario_Limonciello you were exactly right. Got it working with:

  • Kernel: 6.14.0-1014-oem
  • ROCm: 6.4.4

Thanks a bunch!

2 Likes