AMD eGPU on Linux

It’s probably coming from the amdkfd driver, it might require the use of ftrace to understand where it comes from (or alternatively add a bunch of printk/pr_info calls).

If I’m not mistaken /dev/kfd is handled by kfd_open in drivers/gpu/drm/amd/amdkfd/kfd_chardev.c but there are no lines returning EINVAL :thinking:

Yeah it’s going to be deeper in one of the calls made in the ioctl, it’s non obvious to me which is why I was hoping you can find it on your setup that you can hotplug like that.

One hint that might help; turn on dynamic debugging for kfd_process.c.

There are a bunch of messages in failure paths that might show where it happened.

1 Like

For now; I sent out the patch that you’ve thus far tested so we can get that half reviewed and merged.

Once you get some dynamic debug output or something else that hints more about the second part of the problem we can figure out a patch for that.

1 Like

Thanks a lot! Will try to test it this week.

I’ve got an updated patch, can you try this instead?

1 Like

Spoiler alert: it almost works!

Ok. I’ve removed the previous patch and applied the most recent version where amdgpu_amdkfd_device_fini_sw is called between amdgpu_ttm_set_buffer_funcs_status and amdgpu_device_ip_fini_early.

Before I attached eGPU I checked to see if anything else is OK.

First thing I’ve noticed is that rocminfo now fails to display info about the integrated GPU, only CPU node is shown. Note the GPU node has an unrecognized id line:

korvin@fw13:~$ uname -r
6.18.0

korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0  1

korvin@fw13:~$ rocminfo 
ROCk module is loaded
Warning: Agent creation failed.
The GPU node has an unrecognized id.

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      49152(0xc000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5157                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*** Done ***             

Don’t sure if it caused by recent patch, but that was the only change I did in the meantime.

After connecting eGPU:

korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0  1  2

korvin@fw13:~$ rocminfo 
ROCk module is loaded
Warning: Agent creation failed.
The GPU node has an unrecognized id.

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      49152(0xc000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5157                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5700 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 29471(0x731f)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   1280                               
  Internal Node ID:        2                                  
  Compute Unit:            40                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 149                                
  SDMA engine uCode::      35                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

After disconnecting eGPU:

korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0  1

rocminfo still works! As expected, it shows info on the CPU node.

After reconnecting eGPU again:

korvin@fw13:~$ ls /sys/class/kfd/kfd/topology/nodes/
0  1  2

korvin@fw13:~$ rocminfo 
ROCk module is loaded
Warning: Agent creation failed.
The GPU node has an unrecognized id.

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      49152(0xc000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5157                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49112696(0x2ed6678) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5700 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 29471(0x731f)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   1280                               
  Internal Node ID:        2                                  
  Compute Unit:            40                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 149                                
  SDMA engine uCode::      35                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

I set up ollama to use ROCm and not Vulkan and checked that even after reconnect it does indeed run on eGPU. LACT shows VRAM is allocated on eGPU and I can clearly see that the inference happens there as well.

P.S.: Kernel log was normal at all times, I haven’t seen any errors, just normal reaction to sudden link loss when I yanked the cable.

The only issue I see so far is missing rocminfo entry for iGPU :person_shrugging:

1 Like

Update:

The issue with iGPU not being recognized is probably because when doing strace I replaced the snap package of rocminfo with older apt package.

I reinstalled the snap package again and…

korvin@fw13:~$ rocminfo
ROCk module is loaded
/usr/share/libdrm/amdgpu.ids: No such file or directory
/usr/share/libdrm/amdgpu.ids: No such file or directory
hsa api call failure at: /build/rocminfo/parts/rocminfo/src/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

The older apt version, however, still works fine. So it’s probably the tool issue, not a kernel one.

But the most important thing remains: ollama is now able to run on eGPU even after cable reconnect.

Great, sounds solid now then, I’ll get the patch landed in the kernel.

2 Likes

Awesome. Thank you so much for spending your time on it! I really appreciate it. The Framework community is so privileged having you as a member.

2 Likes

Have some new info. Not sure that it’s related to recent changes, but I’ve just experienced it twice in a row.

I just connect the TB cable and this happens:

[15863.720391] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[15863.720402] thunderbolt 0-2: TB4 HOME TB4 eGFX
[15864.438885] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833
[15864.553230] pcieport 0000:00:01.1: pciehp: Slot(0): Card present
[15864.553239] pcieport 0000:00:01.1: pciehp: Slot(0): Link Up
[15864.674561] pci 0000:01:00.0: [8086:1576] type 01 class 0x060400 PCIe Switch Upstream Port
[15864.674621] pci 0000:01:00.0: PCI bridge to [bus 00]
[15864.674640] pci 0000:01:00.0:   bridge window [io  0x0000-0x0fff]
[15864.674647] pci 0000:01:00.0:   bridge window [mem 0x00000000-0x000fffff]
[15864.674666] pci 0000:01:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[15864.674687] pci 0000:01:00.0: enabling Extended Tags
[15864.674925] pci 0000:01:00.0: supports D1 D2
[15864.674927] pci 0000:01:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[15864.675654] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 8.000 Gb/s with 2.5 GT/s PCIe x4 link)
[15864.676004] pci 0000:01:00.0: Adding to iommu group 29
[15864.676209] pcieport 0000:00:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[15864.677770] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[15864.677920] pci 0000:02:01.0: [8086:1576] type 01 class 0x060400 PCIe Switch Downstream Port
[15864.677961] pci 0000:02:01.0: PCI bridge to [bus 00]
[15864.677972] pci 0000:02:01.0:   bridge window [io  0x0000-0x0fff]
[15864.677977] pci 0000:02:01.0:   bridge window [mem 0x00000000-0x000fffff]
[15864.677998] pci 0000:02:01.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[15864.678020] pci 0000:02:01.0: enabling Extended Tags
[15864.678164] pci 0000:02:01.0: supports D1 D2
[15864.678166] pci 0000:02:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[15864.678539] pci 0000:02:01.0: Adding to iommu group 30
[15864.678754] pci 0000:01:00.0: PCI bridge to [bus 02-5f]
[15864.678778] pci 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[15864.678922] pci 0000:03:00.0: [1002:1478] type 01 class 0x060400 PCIe Switch Upstream Port
[15864.678985] pci 0000:03:00.0: BAR 0 [mem 0x00000000-0x00003fff]
[15864.678995] pci 0000:03:00.0: PCI bridge to [bus 00]
[15864.679011] pci 0000:03:00.0:   bridge window [io  0x0000-0x0fff]
[15864.679019] pci 0000:03:00.0:   bridge window [mem 0x00000000-0x000fffff]
[15864.679048] pci 0000:03:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[15864.679319] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[15864.679573] pci 0000:03:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[15864.680075] pci 0000:03:00.0: Adding to iommu group 30
[15864.680160] pci 0000:02:01.0: PCI bridge to [bus 03-5f]
[15864.680182] pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[15864.680422] pci 0000:04:00.0: [1002:1479] type 01 class 0x060400 PCIe Switch Downstream Port
[15864.680497] pci 0000:04:00.0: PCI bridge to [bus 00]
[15864.680515] pci 0000:04:00.0:   bridge window [io  0x0000-0x0fff]
[15864.680524] pci 0000:04:00.0:   bridge window [mem 0x00000000-0x000fffff]
[15864.680554] pci 0000:04:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[15864.680906] pci 0000:04:00.0: PME# supported from D0 D3hot D3cold
[15864.681365] pci 0000:04:00.0: Adding to iommu group 30
[15864.681489] pci 0000:03:00.0: PCI bridge to [bus 04-5f]
[15864.681527] pci 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[15864.682018] pci 0000:05:00.0: [1002:731f] type 00 class 0x030000 PCIe Legacy Endpoint
[15864.682141] pci 0000:05:00.0: BAR 0 [mem 0x00000000-0x0fffffff 64bit pref]
[15864.682151] pci 0000:05:00.0: BAR 2 [mem 0x00000000-0x001fffff 64bit pref]
[15864.682157] pci 0000:05:00.0: BAR 4 [io  0x0000-0x00ff]
[15864.682163] pci 0000:05:00.0: BAR 5 [mem 0x00000000-0x0007ffff]
[15864.682168] pci 0000:05:00.0: ROM [mem 0x00000000-0x0001ffff pref]
[15864.682572] pci 0000:05:00.0: PME# supported from D1 D2 D3hot D3cold
[15864.682963] pci 0000:05:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[15864.683187] pci 0000:05:00.0: Adding to iommu group 30
[15864.683209] pci 0000:05:00.0: vgaarb: setting as boot VGA device
[15864.683212] pci 0000:05:00.0: vgaarb: bridge control possible
[15864.683213] pci 0000:05:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[15864.683322] pci 0000:05:00.1: [1002:ab38] type 00 class 0x040300 PCIe Legacy Endpoint
[15864.683436] pci 0000:05:00.1: BAR 0 [mem 0x00000000-0x00003fff]
[15864.683702] pci 0000:05:00.1: PME# supported from D1 D2 D3hot D3cold
[15864.683941] pci 0000:05:00.1: Adding to iommu group 30
[15864.684084] pci 0000:04:00.0: PCI bridge to [bus 05-5f]
[15864.684119] pci_bus 0000:05: busn_res: [bus 05-5f] end is updated to 05
[15864.684131] pci_bus 0000:04: busn_res: [bus 04-5f] end is updated to 05
[15864.684142] pci_bus 0000:03: busn_res: [bus 03-5f] end is updated to 5f
[15864.684149] pci_bus 0000:02: busn_res: [bus 02-5f] end is updated to 5f
[15864.684173] pci 0000:01:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[15864.684176] pci 0000:01:00.0: bridge window [mem 0x98000000-0xafffffff]: assigned
[15864.684179] pci 0000:01:00.0: bridge window [io  0x7000-0xafff]: assigned
[15864.684183] pci 0000:02:01.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[15864.684186] pci 0000:02:01.0: bridge window [mem 0x98000000-0xafffffff]: assigned
[15864.684188] pci 0000:02:01.0: bridge window [io  0x7000-0xafff]: assigned
[15864.684191] pci 0000:03:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[15864.684193] pci 0000:03:00.0: bridge window [mem 0x98000000-0xafefffff]: assigned
[15864.684196] pci 0000:03:00.0: BAR 0 [mem 0xaff00000-0xaff03fff]: assigned
[15864.684204] pci 0000:03:00.0: bridge window [io  0x7000-0xafff]: assigned
[15864.684207] pci 0000:04:00.0: bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]: assigned
[15864.684209] pci 0000:04:00.0: bridge window [mem 0x98000000-0xafefffff]: assigned
[15864.684211] pci 0000:04:00.0: bridge window [io  0x7000-0xafff]: assigned
[15864.684215] pci 0000:05:00.0: BAR 0 [mem 0x3800000000-0x380fffffff 64bit pref]: assigned
[15864.684240] pci 0000:05:00.0: BAR 2 [mem 0x3810000000-0x38101fffff 64bit pref]: assigned
[15864.684263] pci 0000:05:00.0: BAR 5 [mem 0x98000000-0x9807ffff]: assigned
[15864.684272] pci 0000:05:00.0: ROM [mem 0x98080000-0x9809ffff pref]: assigned
[15864.684274] pci 0000:05:00.1: BAR 0 [mem 0x980a0000-0x980a3fff]: assigned
[15864.684282] pci 0000:05:00.0: BAR 4 [io  0x7000-0x70ff]: assigned
[15864.684291] pci 0000:04:00.0: PCI bridge to [bus 05]
[15864.684296] pci 0000:04:00.0:   bridge window [io  0x7000-0xafff]
[15864.684308] pci 0000:04:00.0:   bridge window [mem 0x98000000-0xafefffff]
[15864.684316] pci 0000:04:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[15864.684330] pci 0000:03:00.0: PCI bridge to [bus 04-05]
[15864.684334] pci 0000:03:00.0:   bridge window [io  0x7000-0xafff]
[15864.684345] pci 0000:03:00.0:   bridge window [mem 0x98000000-0xafefffff]
[15864.684353] pci 0000:03:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[15864.684367] pci 0000:02:01.0: PCI bridge to [bus 03-5f]
[15864.684370] pci 0000:02:01.0:   bridge window [io  0x7000-0xafff]
[15864.684376] pci 0000:02:01.0:   bridge window [mem 0x98000000-0xafffffff]
[15864.684381] pci 0000:02:01.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[15864.684390] pci 0000:01:00.0: PCI bridge to [bus 02-5f]
[15864.684393] pci 0000:01:00.0:   bridge window [io  0x7000-0xafff]
[15864.684399] pci 0000:01:00.0:   bridge window [mem 0x98000000-0xafffffff]
[15864.684404] pci 0000:01:00.0:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[15864.684412] pcieport 0000:00:01.1: PCI bridge to [bus 01-5f]
[15864.684415] pcieport 0000:00:01.1:   bridge window [io  0x7000-0xafff]
[15864.684418] pcieport 0000:00:01.1:   bridge window [mem 0x98000000-0xafffffff]
[15864.684421] pcieport 0000:00:01.1:   bridge window [mem 0x3800000000-0x57ffffffff 64bit pref]
[15864.684822] pcieport 0000:01:00.0: enabling device (0000 -> 0003)
[15864.685050] pcieport 0000:02:01.0: enabling device (0000 -> 0003)
[15864.685189] pcieport 0000:02:01.0: pciehp: Slot #1 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
[15864.685718] pcieport 0000:03:00.0: enabling device (0000 -> 0003)
[15864.685954] pcieport 0000:04:00.0: enabling device (0000 -> 0003)
[15864.686678] pci 0000:05:00.0: disabling ATS
[15864.686823] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[15864.686847] amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1458:0x2313 0xC1).
[15864.686993] amdgpu 0000:05:00.0: amdgpu: register mmio base: 0x98000000
[15864.686995] amdgpu 0000:05:00.0: amdgpu: register mmio size: 524288
[15864.687074] amdgpu 0000:05:00.0: amdgpu: failed to read discovery info from memory, vram size read: 0
[15864.687085] amdgpu 0000:05:00.0: amdgpu: [drm] *ERROR* discovery failed: -2
[15864.687087] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[15864.687090] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[15864.687106] amdgpu 0000:05:00.0: probe with driver amdgpu failed with error -2
[15864.687253] pci 0000:05:00.1: D0 power state depends on 0000:05:00.0
[15864.687308] snd_hda_intel 0000:05:00.1: enabling device (0000 -> 0002)
[15864.687371] snd_hda_intel 0000:05:00.1: Handle vga_switcheroo audio client
[15864.687374] snd_hda_intel 0000:05:00.1: Force to non-snoop mode

Also, yesterday after all those connect/disconnect experiments my laptop froze when I tried to put it to sleep after disconnecting the cable the last time.

The fail to read shouldn’t be related to those changes. But I am wondering if maybe the freeze is related to some resources not being freed upon unplugging.

Keep an eye on it the next week and try to find patterns as this happens and we can think about next steps.

I’ve just caught a crash in kfd_cleanup_nodes.

Tldr:

[  208.207050]  <TASK>
[  208.207054]  device_queue_manager_uninit+0x26/0xb0 [amdgpu]
[  208.207226]  kfd_cleanup_nodes+0x56/0xe0 [amdgpu]
[  208.207428]  kgd2kfd_device_exit+0x4a/0xa0 [amdgpu]
[  208.207599]  amdgpu_amdkfd_device_fini_sw+0x25/0x60 [amdgpu]
[  208.207774]  amdgpu_device_fini_hw+0x190/0x45e [amdgpu]
[  208.208087]  amdgpu_driver_unload_kms+0x4f/0x60 [amdgpu]
[  208.208243]  amdgpu_pci_remove+0x4d/0x90 [amdgpu]
[  208.208398]  pci_device_remove+0x4b/0xc0
[  208.208404]  device_remove+0x43/0x80

The basic test setup is the following:

  1. Connect eGPU
  2. Set up ollama to use either Vulkan or ROCm
  3. Restart ollama service to pick up the new settings
  4. Load a model and run it, e.g. ollama run qwen3:8b
  5. Make sure it runs on eGPU
  6. Interrupt the ollama output and yank the TB4 cable whilst the model is still in VRAM
  7. Put the laptop to sleep (not needed actually, the error happens before)

I did test several times and am pretty sure now that the system hard freezes only when ollama was running ROCm. Vulkan setup, while loading slow as hell, still runs OK and then does not cause any freezes.

Post disconnect errors for the Vulkan scenario:

[23901.976656] pcieport 0000:00:01.1: PME: Spurious native interrupt! 
[23901.976679] pcieport 0000:00:01.1: PME: Spurious native interrupt! 
[23902.082508] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535 
[23902.320878] amdgpu 0000:05:00.0: amdgpu: device lost from bus! 
[23902.320894] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:Transf
erTableSmu2Dram? 
[23902.320902] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table! 
[23902.487464] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535 
[23902.759351] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_de
nylist 
[23902.821816] amdgpu 0000:05:00.0: amdgpu: device lost from bus! 
[23902.821832] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:Transf
erTableSmu2Dram? 
[23902.821839] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table! 
[23903.003569] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini 
[23903.027748] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device. 
[23903.294303] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (
-110) 
[23903.824267] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx 
[23903.824804] amdgpu 0000:05:00.0: amdgpu: device lost from bus! 
[23903.824806] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:Disable
AllSmuFeatures? 
[23903.824810] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features. 
[23903.824898] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000001ff18020; ring_buffer_end = 000000004b60b
bf9; write_frame = 00000000a44b2101 
[23903.824902] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds 
[23903.824987] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000001ff18020; ring_buffer_end = 000000004b60b
bf9; write_frame = 00000000a44b2101 
[23903.824989] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds 
[23903.825074] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000001ff18020; ring_buffer_end = 000000004b60b
bf9; write_frame = 00000000a44b2101 
[23903.825076] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds 
[23903.825161] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 000000001ff18020; ring_buffer_end = 000000004b60b
bf9; write_frame = 00000000a44b2101 
[23903.825162] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds 
[23904.111091] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp:
80000000 
[23904.111095] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring

Post disconnect errors for the ROCm scenario:

[  197.782105] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
[  197.782118] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[  197.782171] thunderbolt 0-0:2.1: retimer disconnected
[  197.783548] thunderbolt 0-2: device disconnected
[  197.844546] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[  197.844570] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[  197.948975] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[  198.011494] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  198.011508] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[  198.011515] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[  198.353818] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[  198.513442] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  198.513457] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[  198.513463] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[  198.654054] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
[  198.944192] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
[  198.948380] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[  207.950398] amdgpu 0000:05:00.0: amdgpu: qcm fence wait loop timeout expired
[  207.950406] amdgpu 0000:05:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  207.950429] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!. Source:  4
[  207.950438] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  207.950442] amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -19
[  208.206190] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[  208.206428] ------------[ cut here ]------------
[  208.206429] WARNING: CPU: 7 PID: 205 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:1546 uninitialize+0x63/0x80 [amdgpu]
[  208.206712] Modules linked in: ccm(E) rfcomm(E) snd_seq_dummy(E) snd_hrtimer(E) xt_conntrack(E) xt_MASQUERADE(E) xt_set(E) ip_set(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xt_addrtype(E) nft_compat(E) nf_tables(E) xfrm_user(E) xfrm_algo(E) qrtr(E) cmac(E) algif_hash(E) algif_skcipher(E) af_alg(E) bnep(E) snd_ctl_led(E) snd_acp_legacy_mach(E) snd_acp_mach(E) snd_soc_nau8821(E) snd_acp3x_rn(E) binfmt_misc(E) snd_acp70(E) snd_acp_i2s(E) snd_acp_pdm(E) snd_soc_dmic(E) snd_acp_pcm(E) snd_sof_amd_acp70(E) snd_sof_amd_acp63(E) snd_sof_amd_vangogh(E) snd_sof_amd_rembrandt(E) snd_sof_amd_renoir(E) snd_sof_amd_acp(E) intel_rapl_msr(E) amd_atl(E) nls_iso8859_1(E) snd_sof_pci(E) snd_sof_xtensa_dsp(E) intel_rapl_common(E) snd_hda_codec_alc269(E) snd_hda_scodec_component(E) snd_sof(E) snd_hda_codec_realtek_lib(E) snd_sof_utils(E) snd_hda_codec_generic(E) snd_pci_ps(E) snd_soc_acpi_amd_match(E) snd_amd_sdw_acpi(E) soundwire_amd(E) snd_hda_codec_atihdmi(E) soundwire_generic_allocation(E)
[  208.206745]  soundwire_bus(E) snd_hda_codec_hdmi(E) snd_soc_sdca(E) amdgpu(E) edac_mce_amd(E) snd_hda_intel(E) snd_hda_codec(E) snd_soc_core(E) snd_usb_audio(E) snd_hda_core(E) snd_compress(E) amdxcp(E) kvm_amd(E) snd_intel_dspcfg(E) snd_usbmidi_lib(E) drm_panel_backlight_quirks(E) ac97_bus(E) snd_intel_sdw_acpi(E) snd_hwdep(E) snd_ump(E) btusb(E) snd_pcm_dmaengine(E) drm_buddy(E) leds_cros_ec(E) cros_ec_chardev(E) cros_charge_control(E) uvcvideo(E) cros_ec_debugfs(E) gpio_cros_ec(E) cros_ec_hwmon(E) led_class_multicolor(E) cros_ec_sysfs(E) cros_kbd_led_backlight(E) snd_rpl_pci_acp6x(E) spd5118(E) btmtk(E) snd_seq_midi(E) drm_ttm_helper(E) videobuf2_vmalloc(E) hid_sensor_als(E) btrtl(E) snd_seq_midi_event(E) snd_acp_pci(E) kvm(E) uvc(E) ttm(E) rtw88_8822be(E) hid_sensor_trigger(E) videobuf2_memops(E) btbcm(E) snd_amd_acpi_mach(E) rtw88_8822b(E) drm_exec(E) industrialio_triggered_buffer(E) irqbypass(E) snd_acp_legacy_common(E) videobuf2_v4l2(E) snd_rawmidi(E) rtw88_pci(E) drm_suballoc_helper(E) btintel(E) kfifo_buf(E)
[  208.206780]  snd_pci_acp6x(E) videobuf2_common(E) cros_ec_dev(E) hid_sensor_iio_common(E) rapl(E) rtw88_core(E) videodev(E) drm_display_helper(E) snd_pcm(E) snd_seq(E) bluetooth(E) industrialio(E) snd_seq_device(E) wmi_bmof(E) mc(E) snd_pci_acp5x(E) cec(E) mac80211(E) snd_timer(E) i2c_piix4(E) snd_rn_pci_acp3x(E) k10temp(E) amdxdna(E) rc_core(E) snd_acp_config(E) i2c_smbus(E) snd(E) cfg80211(E) i2c_algo_bit(E) snd_soc_acpi(E) gpu_sched(E) snd_pci_acp3x(E) soundcore(E) libarc4(E) amd_pmf(E) amdtee(E) ccp(E) amd_sfh(E) cros_ec_lpcs(E) tee(E) cros_ec(E) amd_pmc(E) platform_profile(E) cros_ec_proto(E) joydev(E) input_leds(E) mac_hid(E) serio_raw(E) sch_fq_codel(E) overlay(E) iptable_filter(E) ip6table_filter(E) ip6_tables(E) br_netfilter(E) bridge(E) stp(E) llc(E) arp_tables(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E) efi_pstore(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) autofs4(E) cdc_ether(E) usbnet(E) uas(E) usb_storage(E) r8152(E) mii(E) dm_crypt(E) usbhid(E) nvme(E) ucsi_acpi(E) nvme_core(E)
[  208.206825]  typec_ucsi(E) nvme_keyring(E) typec(E) nvme_auth(E) hid_multitouch(E) hid_sensor_hub(E) hid_generic(E) polyval_clmulni(E) ghash_clmulni_intel(E) hkdf(E) i2c_hid_acpi(E) thunderbolt(E) video(E) i2c_hid(E) wmi(E) hid(E) aesni_intel(E)
[  208.206837] CPU: 7 UID: 0 PID: 205 Comm: irq/34-pciehp Tainted: G            E       6.18.0 #5 PREEMPT(voluntary) 
[  208.206841] Tainted: [E]=UNSIGNED_MODULE
[  208.206842] Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, BIOS 03.03 03/10/2025
[  208.206844] RIP: 0010:uninitialize+0x63/0x80 [amdgpu]
[  208.207033] Code: 73 2c 49 8b bc dc c8 00 00 00 48 83 c3 01 e8 a4 57 ab c1 48 83 fb 04 75 e0 5b 41 5c 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc <0f> 0b eb bc 48 c7 c7 00 6e 4c c2 e8 8d c5 fd c1 eb c6 66 66 2e 0f
[  208.207035] RSP: 0018:ffffcb42c09e3a70 EFLAGS: 00010202
[  208.207037] RAX: ffffffffc1b15e10 RBX: ffff89b2247bd800 RCX: 0000000000000000
[  208.207039] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff89b2247bd800
[  208.207040] RBP: ffffcb42c09e3a80 R08: 0000000000000000 R09: 0000000000000000
[  208.207041] R10: 0000000000000000 R11: 0000000000000000 R12: ffff89b2247bd800
[  208.207041] R13: 0000000000000000 R14: ffff89b20a97f800 R15: 0000000000000001
[  208.207043] FS:  0000000000000000(0000) GS:ffff89bdc8004000(0000) knlGS:0000000000000000
[  208.207044] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  208.207045] CR2: 00007412903c5b0c CR3: 00000001dce40000 CR4: 0000000000f50ef0
[  208.207047] PKRU: 55555554
[  208.207048] Call Trace:
[  208.207050]  <TASK>
[  208.207054]  device_queue_manager_uninit+0x26/0xb0 [amdgpu]
[  208.207226]  kfd_cleanup_nodes+0x56/0xe0 [amdgpu]
[  208.207428]  kgd2kfd_device_exit+0x4a/0xa0 [amdgpu]
[  208.207599]  amdgpu_amdkfd_device_fini_sw+0x25/0x60 [amdgpu]
[  208.207774]  amdgpu_device_fini_hw+0x190/0x45e [amdgpu]
[  208.208087]  amdgpu_driver_unload_kms+0x4f/0x60 [amdgpu]
[  208.208243]  amdgpu_pci_remove+0x4d/0x90 [amdgpu]
[  208.208398]  pci_device_remove+0x4b/0xc0
[  208.208404]  device_remove+0x43/0x80
[  208.208408]  device_release_driver_internal+0x1fb/0x260
[  208.208411]  device_release_driver+0x12/0x20
[  208.208413]  pci_stop_bus_device+0x5f/0x80
[  208.208417]  pci_stop_bus_device+0x30/0x80
[  208.208419]  pci_stop_bus_device+0x30/0x80
[  208.208422]  pci_stop_bus_device+0x30/0x80
[  208.208424]  pci_stop_bus_device+0x30/0x80
[  208.208426]  pci_stop_and_remove_bus_device+0x16/0x30
[  208.208429]  pciehp_unconfigure_device+0x96/0x1a0
[  208.208433]  pciehp_disable_slot+0x68/0x110
[  208.208435]  pciehp_handle_presence_or_link_change+0x74/0x360
[  208.208438]  pciehp_ist+0x15b/0x1f0
[  208.208441]  irq_thread_fn+0x26/0x70
[  208.208445]  irq_thread+0x1cb/0x370
[  208.208447]  ? __pfx_irq_thread_fn+0x10/0x10
[  208.208450]  ? __pfx_irq_thread_dtor+0x10/0x10
[  208.208452]  ? __pfx_irq_thread+0x10/0x10
[  208.208454]  kthread+0x10b/0x220
[  208.208457]  ? __pfx_kthread+0x10/0x10
[  208.208459]  ret_from_fork+0x202/0x230
[  208.208463]  ? __pfx_kthread+0x10/0x10
[  208.208464]  ret_from_fork_asm+0x1a/0x30
[  208.208468]  </TASK>
[  208.208469] ---[ end trace 0000000000000000 ]---
[  208.738660] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[  208.739202] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  208.739204] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[  208.739208] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features.
[  208.739295] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000580f693e; ring_buffer_end = 000000006ad967c3; write_frame = 00000000ea9aba57
[  208.739298] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  208.739391] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000580f693e; ring_buffer_end = 000000006ad967c3; write_frame = 00000000ea9aba57
[  208.739392] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  208.739480] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000580f693e; ring_buffer_end = 000000006ad967c3; write_frame = 00000000ea9aba57
[  208.739482] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  208.739567] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000580f693e; ring_buffer_end = 000000006ad967c3; write_frame = 00000000ea9aba57
[  208.739569] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  209.024952] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[  209.024955] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring

So yeah, no wonder it freezes, lol.

The WARN mentioned in the log is actually this:

static void uninitialize(struct device_queue_manager *dqm)
{
    int i;

    WARN_ON(dqm->active_queue_count > 0 || dqm->processes_count > 0);
    kfree(dqm->allocated_queues);
    for (i = 0 ; i < KFD_MQD_TYPE_MAX ; i++)
        kfree(dqm->mqd_mgrs[i]);
    mutex_destroy(&dqm->lock_hidden);
}

If I’m not mistaken that means that the queues were not properly deactivated before deallocating them, which in turn can cause the crash.

@Mario_Limonciello Could you give me a few hints on where to look next, please?

This part of the codebase (the KFD driver) I’m not very familiar. But I do think you’re on the right path.

I’m wondering if we have a DRM core bug that we never call amdgpu_drm_release or amdgpu_driver_release_kms.

I looked through and I think that amdgpu_driver_release_kmsand amdgpu_drm_releaseboth get called.

But I don’t think any queues will have been destroyed, so if there is an active process nothing ever could successfully call kfd_ioctl_destroy_queue and thus pqm_destroy_queuedoesn’t get called either.

Have a try with this (paired with the other one)

1 Like

Ok, I’ve just applied the second patch.

The situation has improved. I no longer see any crashes in the log.

[  120.976169] amdgpu: Freeing queue vital buffer 0x7dc351600000, queue evicted
[  135.620847] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
[  135.620860] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[  135.620912] thunderbolt 0-0:2.1: retimer disconnected
[  135.622301] thunderbolt 0-2: device disconnected
[  135.683337] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[  135.683361] pcieport 0000:00:01.1: PME: Spurious native interrupt!
[  135.787618] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[  136.030695] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  136.030709] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[  136.030716] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[  136.192756] snd_hda_intel 0000:05:00.1: CORB reset timeout#2, CORBRP = 65535
[  136.492180] snd_hda_intel 0000:05:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist
[  136.531831] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  136.531847] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000006 message:TransferTableSmu2Dram?
[  136.531854] amdgpu 0000:05:00.0: amdgpu: Failed to export SMU metrics table!
[  136.762923] amdgpu 0000:05:00.0: amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
[  136.765863] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[  136.767360] kfd kfd: amdgpu: Terminating queues for process 4757 on unplugged device
[  145.766275] amdgpu 0000:05:00.0: amdgpu: qcm fence wait loop timeout expired
[  145.766282] amdgpu 0000:05:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  145.766286] amdgpu: Resetting wave fronts (cpsch) on dev 0000000080cb9960
[  145.766306] amdgpu 0000:05:00.0: amdgpu: Didn't find vmid for process pid 4757
[  145.766309] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!. Source:  4
[  145.766313] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  145.766315] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:36 param:0x00000001 message:SetWorkloadMask?
[  145.766318] amdgpu 0000:05:00.0: amdgpu: Failed to set workload mask 0x00000001
[  145.766318] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  145.766324] amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -19
[  154.766317] amdgpu 0000:05:00.0: amdgpu: qcm fence wait loop timeout expired
[  154.766325] amdgpu 0000:05:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  154.766351] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!. Source:  4
[  154.766361] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  154.766366] amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -19
[  155.035488] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[  155.595972] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[  155.596594] amdgpu 0000:05:00.0: amdgpu: device lost from bus!
[  155.596596] amdgpu 0000:05:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[  155.596600] amdgpu 0000:05:00.0: amdgpu: Failed to disable smu features.
[  155.596690] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000025ad415; ring_buffer_end = 00000000f46e136c; write_frame = 00000000a3f9b5a4
[  155.596694] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  155.596780] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000025ad415; ring_buffer_end = 00000000f46e136c; write_frame = 00000000a3f9b5a4
[  155.596782] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  155.596867] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000025ad415; ring_buffer_end = 00000000f46e136c; write_frame = 00000000a3f9b5a4
[  155.596868] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  155.596954] amdgpu 0000:05:00.0: amdgpu: ring_buffer_start = 00000000025ad415; ring_buffer_end = 00000000f46e136c; write_frame = 00000000a3f9b5a4
[  155.596956] amdgpu 0000:05:00.0: amdgpu: write_frame is pointing to address out of bounds
[  155.893374] amdgpu 0000:05:00.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[  155.893379] [drm:psp_v11_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring

However, when I reattach the cable, it fails to initialize:

[  182.455315] thunderbolt 0-2: new device found, vendor=0x215 device=0x2
[  182.455324] thunderbolt 0-2: TB4 HOME TB4 eGFX
[  183.187284] thunderbolt 0-0:2.1: new retimer found, vendor=0x1da0 device=0x8833

And that’s basically it.

Also, one CPU kernel is 100% busy. Looks like a spin lock that takes forever to be acquired.

Oh, after a couple of minutes it recovered from the stall:

[  369.859595] INFO: task irq/34-pciehp:200 blocked for more than 122 seconds.
[  369.859616]       Tainted: G            E       6.18.0 #6
[  369.859620] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.859623] task:irq/34-pciehp   state:D stack:0     pid:200   tgid:200   ppid:2      task_flags:0x208140 flags:0x00080000
[  369.859633] Call Trace:
[  369.859637]  <TASK>
[  369.859644]  __schedule+0x428/0x1700
[  369.859658]  schedule+0x27/0xf0
[  369.859661]  schedule_timeout+0xcf/0x110
[  369.859668]  wait_for_completion+0x82/0x160
[  369.859672]  memunmap_pages+0x91/0x2b0
[  369.859679]  ? remove_files+0x30/0x70
[  369.859692]  devm_memremap_pages_release+0xe/0x20
[  369.859696]  devm_action_release+0x15/0x30
[  369.859702]  release_nodes+0x3d/0xd0
[  369.859707]  devres_release_all+0x95/0xd0
[  369.859713]  device_unbind_cleanup+0x12/0x90
[  369.859718]  device_release_driver_internal+0x220/0x260
[  369.859723]  device_release_driver+0x12/0x20
[  369.859726]  pci_stop_bus_device+0x5f/0x80
[  369.859733]  pci_stop_bus_device+0x30/0x80
[  369.859737]  pci_stop_bus_device+0x30/0x80
[  369.859740]  pci_stop_bus_device+0x30/0x80
[  369.859744]  pci_stop_bus_device+0x30/0x80
[  369.859747]  pci_stop_and_remove_bus_device+0x16/0x30
[  369.859752]  pciehp_unconfigure_device+0x96/0x1a0
[  369.859758]  pciehp_disable_slot+0x68/0x110
[  369.859762]  pciehp_handle_presence_or_link_change+0x74/0x360
[  369.859767]  pciehp_ist+0x15b/0x1f0
[  369.859772]  irq_thread_fn+0x26/0x70
[  369.859779]  irq_thread+0x1cb/0x370
[  369.859782]  ? __pfx_irq_thread_fn+0x10/0x10
[  369.859786]  ? __pfx_irq_thread_dtor+0x10/0x10
[  369.859790]  ? __pfx_irq_thread+0x10/0x10
[  369.859794]  kthread+0x10b/0x220
[  369.859799]  ? __pfx_kthread+0x10/0x10
[  369.859802]  ret_from_fork+0x202/0x230
[  369.859808]  ? __pfx_kthread+0x10/0x10
[  369.859810]  ret_from_fork_asm+0x1a/0x30
[  369.859818]  </TASK>
[  459.478280] Oops: general protection fault, probably for non-canonical address 0xafda4ad75fa0f245: 0000 [#1] SMP NOPTI
[  459.478296] CPU: 23 UID: 998 PID: 4791 Comm: ollama Tainted: G            E       6.18.0 #6 PREEMPT(voluntary) 
[  459.478301] Tainted: [E]=UNSIGNED_MODULE
[  459.478303] Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, BIOS 03.03 03/10/2025
[  459.478306] RIP: 0010:kfd_process_evict_queues+0x6c/0xe0 [amdgpu]
[  459.478651] Code: 44 89 ea 48 8b 3b 48 8b 47 08 4c 8b 30 49 8b 47 60 8b b0 10 0a 00 00 e8 22 c6 01 00 48 8b 03 48 8d 73 10 48 8b b8 80 00 00 00 <48> 8b 47 78 2e 2e 2e ff d0 89 c2 85 c0 74 09 83 f8 fb 0f 85 fd 84
[  459.478654] RSP: 0018:ffffd3804a5037e0 EFLAGS: 00010246
[  459.478659] RAX: ffff8f32e7749300 RBX: ffff8f321aeffc00 RCX: 0000000000000000
[  459.478661] RDX: 0000000000000000 RSI: ffff8f321aeffc10 RDI: afda4ad75fa0f1cd
[  459.478662] RBP: ffffd3804a503810 R08: 0000000000000000 R09: 0000000000000000
[  459.478664] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[  459.478664] R13: 0000000000000001 R14: ffff8f32f12370c8 R15: ffff8f320122a000
[  459.478666] FS:  0000000000000000(0000) GS:ffff8f3d90a04000(0000) knlGS:0000000000000000
[  459.478668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  459.478670] CR2: 0000789227b88ac0 CR3: 00000001ee800000 CR4: 0000000000f50ef0
[  459.478672] PKRU: 55555554
[  459.478673] Call Trace:
[  459.478676]  <TASK>
[  459.478680]  kgd2kfd_quiesce_mm+0x41/0x80 [amdgpu]
[  459.478897]  amdgpu_amdkfd_evict_userptr+0xc6/0x100 [amdgpu]
[  459.479118]  amdgpu_hmm_invalidate_hsa+0x35/0x50 [amdgpu]
[  459.479322]  __mmu_notifier_release+0x185/0x200
[  459.479331]  exit_mmap+0x3ba/0x400
[  459.479338]  __mmput+0x41/0x150
[  459.479343]  mmput+0x31/0x40
[  459.479345]  do_exit+0x25f/0xa20
[  459.479351]  do_group_exit+0x2d/0xb0
[  459.479354]  get_signal+0x8b7/0x8f0
[  459.479358]  arch_do_signal_or_restart+0x3a/0x270
[  459.479363]  exit_to_user_mode_loop+0x8b/0x190
[  459.479369]  do_syscall_64+0x222/0xd80
[  459.479375]  ? do_syscall_64+0xba/0xd80
[  459.479377]  ? futex_wait+0x7b/0x140
[  459.479382]  ? sched_clock_noinstr+0x9/0x10
[  459.479386]  ? sched_clock+0x10/0x30
[  459.479390]  ? update_curr+0x38/0x1b0
[  459.479393]  ? place_entity+0xad/0x170
[  459.479396]  ? reweight_entity+0x219/0x240
[  459.479398]  ? nohz_balance_exit_idle+0xac/0xf0
[  459.479403]  ? update_cfs_group+0xa8/0xc0
[  459.479405]  ? sched_balance_trigger+0x160/0x4c0
[  459.479408]  ? sched_tick+0x100/0x280
[  459.479411]  ? __note_gp_changes+0x1f3/0x270
[  459.479416]  ? update_process_times+0xa5/0xe0
[  459.479420]  ? note_gp_changes+0x93/0xa0
[  459.479422]  ? rcu_core+0x1b6/0x3b0
[  459.479425]  ? clockevents_program_event+0xba/0x140
[  459.479429]  ? rcu_core_si+0xe/0x20
[  459.479431]  ? handle_softirqs+0xdf/0x330
[  459.479434]  ? irqentry_exit_to_user_mode+0x2e/0x2a0
[  459.479438]  ? irqentry_exit+0x43/0x50
[  459.479440]  ? sysvec_apic_timer_interrupt+0x54/0xd0
[  459.479442]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  459.479446] RIP: 0033:0x59923b148803
[  459.479448] Code: Unable to access opcode bytes at 0x59923b1487d9.
[  459.479450] RSP: 002b:000074782f5feb28 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[  459.479453] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 000059923b148803
[  459.479454] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c000330148
[  459.479455] RBP: 000074782f5feb70 R08: 0000000000000000 R09: 0000000000000000
[  459.479456] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000
[  459.479457] R13: 0000000000000001 R14: 000000c0003321c0 R15: 0000000000000007
[  459.479459]  </TASK>
[  459.479460] Modules linked in: ccm(E) rfcomm(E) snd_seq_dummy(E) snd_hrtimer(E) xt_conntrack(E) xt_MASQUERADE(E) xt_set(E) ip_set(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xt_addrtype(E) nft_compat(E) nf_tables(E) xfrm_user(E) xfrm_algo(E) qrtr(E) cmac(E) algif_hash(E) algif_skcipher(E) af_alg(E) bnep(E) binfmt_misc(E) snd_ctl_led(E) snd_acp_legacy_mach(E) snd_acp_mach(E) snd_soc_nau8821(E) snd_acp3x_rn(E) nls_iso8859_1(E) snd_acp70(E) snd_acp_i2s(E) snd_soc_dmic(E) snd_acp_pdm(E) snd_acp_pcm(E) snd_sof_amd_acp70(E) snd_sof_amd_acp63(E) snd_sof_amd_vangogh(E) snd_sof_amd_rembrandt(E) snd_sof_amd_renoir(E) snd_sof_amd_acp(E) snd_sof_pci(E) snd_sof_xtensa_dsp(E) snd_sof(E) snd_hda_codec_alc269(E) snd_sof_utils(E) snd_hda_scodec_component(E) snd_pci_ps(E) leds_cros_ec(E) snd_hda_codec_realtek_lib(E) amd_atl(E) intel_rapl_msr(E) snd_soc_acpi_amd_match(E) cros_ec_sysfs(E) gpio_cros_ec(E) cros_ec_debugfs(E) cros_charge_control(E) cros_ec_chardev(E) led_class_multicolor(E)
[  459.479509]  cros_kbd_led_backlight(E) cros_ec_hwmon(E) intel_rapl_common(E) snd_amd_sdw_acpi(E) snd_hda_codec_generic(E) soundwire_amd(E) snd_hda_codec_atihdmi(E) soundwire_generic_allocation(E) snd_hda_codec_hdmi(E) soundwire_bus(E) snd_soc_sdca(E) snd_hda_intel(E) amdgpu(E) spd5118(E) cros_ec_dev(E) snd_soc_core(E) snd_hda_codec(E) snd_compress(E) snd_usb_audio(E) snd_hda_core(E) ac97_bus(E) snd_intel_dspcfg(E) snd_pcm_dmaengine(E) snd_usbmidi_lib(E) snd_intel_sdw_acpi(E) edac_mce_amd(E) amdxcp(E) snd_hwdep(E) uvcvideo(E) rtw88_8822be(E) snd_rpl_pci_acp6x(E) rtw88_8822b(E) snd_ump(E) drm_panel_backlight_quirks(E) videobuf2_vmalloc(E) uvc(E) drm_buddy(E) btusb(E) snd_acp_pci(E) rtw88_pci(E) snd_seq_midi(E) videobuf2_memops(E) drm_ttm_helper(E) snd_amd_acpi_mach(E) snd_seq_midi_event(E) rtw88_core(E) videobuf2_v4l2(E) btmtk(E) kvm_amd(E) ttm(E) hid_sensor_als(E) snd_acp_legacy_common(E) btrtl(E) snd_rawmidi(E) drm_exec(E) snd_pci_acp6x(E) btbcm(E) hid_sensor_trigger(E) videobuf2_common(E) drm_suballoc_helper(E)
[  459.479559]  industrialio_triggered_buffer(E) btintel(E) kvm(E) mac80211(E) snd_seq(E) snd_pci_acp5x(E) drm_display_helper(E) kfifo_buf(E) snd_pcm(E) videodev(E) irqbypass(E) snd_rn_pci_acp3x(E) hid_sensor_iio_common(E) i2c_piix4(E) cec(E) snd_seq_device(E) snd_acp_config(E) rapl(E) industrialio(E) wmi_bmof(E) bluetooth(E) mc(E) k10temp(E) snd_timer(E) i2c_smbus(E) snd_soc_acpi(E) snd(E) amdxdna(E) rc_core(E) amd_pmf(E) cfg80211(E) snd_pci_acp3x(E) i2c_algo_bit(E) gpu_sched(E) libarc4(E) amdtee(E) soundcore(E) ccp(E) amd_sfh(E) joydev(E) tee(E) platform_profile(E) cros_ec_lpcs(E) cros_ec(E) input_leds(E) amd_pmc(E) cros_ec_proto(E) mac_hid(E) serio_raw(E) sch_fq_codel(E) overlay(E) iptable_filter(E) ip6table_filter(E) ip6_tables(E) br_netfilter(E) bridge(E) stp(E) llc(E) arp_tables(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E) efi_pstore(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) autofs4(E) cdc_ether(E) usbnet(E) uas(E) r8152(E) usb_storage(E) mii(E) dm_crypt(E) usbhid(E) nvme(E) nvme_core(E)
[  459.479624]  ucsi_acpi(E) nvme_keyring(E) typec_ucsi(E) hid_sensor_hub(E) hid_multitouch(E) typec(E) video(E) polyval_clmulni(E) nvme_auth(E) hid_generic(E) ghash_clmulni_intel(E) hkdf(E) thunderbolt(E) i2c_hid_acpi(E) i2c_hid(E) wmi(E) hid(E) aesni_intel(E)
[  459.479658] ---[ end trace 0000000000000000 ]---
[  459.993348] pstore: backend (efi_pstore) writing error (-28)
[  459.993356] RIP: 0010:kfd_process_evict_queues+0x6c/0xe0 [amdgpu]
[  459.993707] Code: 44 89 ea 48 8b 3b 48 8b 47 08 4c 8b 30 49 8b 47 60 8b b0 10 0a 00 00 e8 22 c6 01 00 48 8b 03 48 8d 73 10 48 8b b8 80 00 00 00 <48> 8b 47 78 2e 2e 2e ff d0 89 c2 85 c0 74 09 83 f8 fb 0f 85 fd 84
[  459.993710] RSP: 0018:ffffd3804a5037e0 EFLAGS: 00010246
[  459.993714] RAX: ffff8f32e7749300 RBX: ffff8f321aeffc00 RCX: 0000000000000000
[  459.993716] RDX: 0000000000000000 RSI: ffff8f321aeffc10 RDI: afda4ad75fa0f1cd
[  459.993717] RBP: ffffd3804a503810 R08: 0000000000000000 R09: 0000000000000000
[  459.993718] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[  459.993719] R13: 0000000000000001 R14: ffff8f32f12370c8 R15: ffff8f320122a000
[  459.993721] FS:  0000000000000000(0000) GS:ffff8f3d90a04000(0000) knlGS:0000000000000000
[  459.993722] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  459.993724] CR2: 0000789227b88ac0 CR3: 0000000400a40000 CR4: 0000000000f50ef0
[  459.993727] PKRU: 55555554
[  459.993729] Fixing recursive fault but reboot is needed!
[  459.993732] BUG: scheduling while atomic: ollama/4791/0x00000000
[  459.993734] Modules linked in: ccm(E) rfcomm(E) snd_seq_dummy(E) snd_hrtimer(E) xt_conntrack(E) xt_MASQUERADE(E) xt_set(E) ip_set(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xt_addrtype(E) nft_compat(E) nf_tables(E) xfrm_user(E) xfrm_algo(E) qrtr(E) cmac(E) algif_hash(E) algif_skcipher(E) af_alg(E) bnep(E) binfmt_misc(E) snd_ctl_led(E) snd_acp_legacy_mach(E) snd_acp_mach(E) snd_soc_nau8821(E) snd_acp3x_rn(E) nls_iso8859_1(E) snd_acp70(E) snd_acp_i2s(E) snd_soc_dmic(E) snd_acp_pdm(E) snd_acp_pcm(E) snd_sof_amd_acp70(E) snd_sof_amd_acp63(E) snd_sof_amd_vangogh(E) snd_sof_amd_rembrandt(E) snd_sof_amd_renoir(E) snd_sof_amd_acp(E) snd_sof_pci(E) snd_sof_xtensa_dsp(E) snd_sof(E) snd_hda_codec_alc269(E) snd_sof_utils(E) snd_hda_scodec_component(E) snd_pci_ps(E) leds_cros_ec(E) snd_hda_codec_realtek_lib(E) amd_atl(E) intel_rapl_msr(E) snd_soc_acpi_amd_match(E) cros_ec_sysfs(E) gpio_cros_ec(E) cros_ec_debugfs(E) cros_charge_control(E) cros_ec_chardev(E) led_class_multicolor(E)
[  459.993776]  cros_kbd_led_backlight(E) cros_ec_hwmon(E) intel_rapl_common(E) snd_amd_sdw_acpi(E) snd_hda_codec_generic(E) soundwire_amd(E) snd_hda_codec_atihdmi(E) soundwire_generic_allocation(E) snd_hda_codec_hdmi(E) soundwire_bus(E) snd_soc_sdca(E) snd_hda_intel(E) amdgpu(E) spd5118(E) cros_ec_dev(E) snd_soc_core(E) snd_hda_codec(E) snd_compress(E) snd_usb_audio(E) snd_hda_core(E) ac97_bus(E) snd_intel_dspcfg(E) snd_pcm_dmaengine(E) snd_usbmidi_lib(E) snd_intel_sdw_acpi(E) edac_mce_amd(E) amdxcp(E) snd_hwdep(E) uvcvideo(E) rtw88_8822be(E) snd_rpl_pci_acp6x(E) rtw88_8822b(E) snd_ump(E) drm_panel_backlight_quirks(E) videobuf2_vmalloc(E) uvc(E) drm_buddy(E) btusb(E) snd_acp_pci(E) rtw88_pci(E) snd_seq_midi(E) videobuf2_memops(E) drm_ttm_helper(E) snd_amd_acpi_mach(E) snd_seq_midi_event(E) rtw88_core(E) videobuf2_v4l2(E) btmtk(E) kvm_amd(E) ttm(E) hid_sensor_als(E) snd_acp_legacy_common(E) btrtl(E) snd_rawmidi(E) drm_exec(E) snd_pci_acp6x(E) btbcm(E) hid_sensor_trigger(E) videobuf2_common(E) drm_suballoc_helper(E)
[  459.993808]  industrialio_triggered_buffer(E) btintel(E) kvm(E) mac80211(E) snd_seq(E) snd_pci_acp5x(E) drm_display_helper(E) kfifo_buf(E) snd_pcm(E) videodev(E) irqbypass(E) snd_rn_pci_acp3x(E) hid_sensor_iio_common(E) i2c_piix4(E) cec(E) snd_seq_device(E) snd_acp_config(E) rapl(E) industrialio(E) wmi_bmof(E) bluetooth(E) mc(E) k10temp(E) snd_timer(E) i2c_smbus(E) snd_soc_acpi(E) snd(E) amdxdna(E) rc_core(E) amd_pmf(E) cfg80211(E) snd_pci_acp3x(E) i2c_algo_bit(E) gpu_sched(E) libarc4(E) amdtee(E) soundcore(E) ccp(E) amd_sfh(E) joydev(E) tee(E) platform_profile(E) cros_ec_lpcs(E) cros_ec(E) input_leds(E) amd_pmc(E) cros_ec_proto(E) mac_hid(E) serio_raw(E) sch_fq_codel(E) overlay(E) iptable_filter(E) ip6table_filter(E) ip6_tables(E) br_netfilter(E) bridge(E) stp(E) llc(E) arp_tables(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E) efi_pstore(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) autofs4(E) cdc_ether(E) usbnet(E) uas(E) r8152(E) usb_storage(E) mii(E) dm_crypt(E) usbhid(E) nvme(E) nvme_core(E)
[  459.993849]  ucsi_acpi(E) nvme_keyring(E) typec_ucsi(E) hid_sensor_hub(E) hid_multitouch(E) typec(E) video(E) polyval_clmulni(E) nvme_auth(E) hid_generic(E) ghash_clmulni_intel(E) hkdf(E) thunderbolt(E) i2c_hid_acpi(E) i2c_hid(E) wmi(E) hid(E) aesni_intel(E)
[  459.993861] CPU: 23 UID: 998 PID: 4791 Comm: ollama Tainted: G      D     E       6.18.0 #6 PREEMPT(voluntary) 
[  459.993865] Tainted: [D]=DIE, [E]=UNSIGNED_MODULE
[  459.993866] Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, BIOS 03.03 03/10/2025
[  459.993868] Call Trace:
[  459.993870]  <TASK>
[  459.993872]  dump_stack_lvl+0x5f/0x90
[  459.993879]  dump_stack+0x10/0x18
[  459.993880]  __schedule_bug.cold+0x46/0x62
[  459.993883]  __schedule+0x11c3/0x1700
[  459.993886]  ? vprintk+0x18/0x50
[  459.993891]  do_task_dead+0x4d/0xb0
[  459.993893]  make_task_dead.cold+0x107/0x113
[  459.993897]  rewind_stack_and_make_dead+0x16/0x20
[  459.993899] RIP: 0033:0x59923b148803
[  459.993902] Code: Unable to access opcode bytes at 0x59923b1487d9.
[  459.993902] RSP: 002b:000074782f5feb28 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[  459.993904] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 000059923b148803
[  459.993905] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c000330148
[  459.993906] RBP: 000074782f5feb70 R08: 0000000000000000 R09: 0000000000000000
[  459.993907] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000
[  459.993907] R13: 0000000000000001 R14: 000000c0003321c0 R15: 0000000000000007
[  459.993910]  </TASK>

Update:

Actually, the post disconnect log already had issues with the qcm fence:

[  136.767360] kfd kfd: amdgpu: Terminating queues for process 4757 on unplugged device
[  145.766275] amdgpu 0000:05:00.0: amdgpu: qcm fence wait loop timeout expired
[  145.766282] amdgpu 0000:05:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption

Did ollama get killed on unplug? Or was it still “running”?