Edit: Happened again within 5 minutes. Switching full card in Google Photos seems to be triggering it quite aggressively. I’ll check on rocm-gdb. Have no idea how to use it but if I succeed maybe I’ll gather more data.
Edit2: Or won’t… rocm-gdb seems like a tool to debug rocm hip kernels. Unless I’m missing something.
Edit3: Stability is awful. Hang every 5-10 minutes. I rolled back the kernel to 6.18.3 to see if this is a matter of my current workflow or the kernel itself. linux-firmware also got updated in the meantime, but I assume that new release should contain all the up-to-date stuff.
Edit4: The same thing is happening w/ 6.18.3. It may be just a wild impression, but it seems that it triggers much faster on 6.18.4 - I managed to make it happen 3 times in a row within 15 minutes window, whereas it took me quite a while on 6.18.3. Similar workload.
Happened again, so no, things are not resolved. While running kernel 6.18.4-arch kernel, browsing Google Photos w/ Brave.
Monitors didn’t recover and remained black / disabled. This time I let it cook for a moment before reboot + took a look remotely. Hang log doesn’t stand out, but this time I also caught hung tasks. I preserved the coredump and can provide it if needed.
[155552.381837] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32788)
[155552.381844] amdgpu 0000:c1:00.0: amdgpu: Process brave pid 3237 thread brave:cs0 pid 3262
[155552.381846] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x000000003f800000 from client 10
[155552.381848] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601430
[155552.381849] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
[155552.381850] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x0
[155552.381851] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0
[155552.381852] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[155552.381852] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0
[155552.381853] amdgpu 0000:c1:00.0: amdgpu: RW: 0x0
[155562.733993] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[155562.734971] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[155562.735061] amdgpu 0000:c1:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[155562.735063] amdgpu 0000:c1:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[155562.735064] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=5790839, emitted seq=5790841
[155562.735066] amdgpu 0000:c1:00.0: amdgpu: Process brave pid 3237 thread brave:cs0 pid 3262
[155562.735068] amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[155564.738873] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[155564.738884] amdgpu 0000:c1:00.0: amdgpu: failed to reset legacy queue
[155564.738886] amdgpu 0000:c1:00.0: amdgpu: reset via MES failed and try pipe reset -110
[155564.738888] amdgpu 0000:c1:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
[155564.738889] amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[155564.738891] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!. Source: 1
[155566.887593] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[155566.887599] amdgpu 0000:c1:00.0: amdgpu: failed to unmap legacy queue
[155567.076743] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[155567.078035] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[155567.104173] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[155567.104310] [drm] PCIE GART of 512M enabled (table at 0x000000801FB00000).
[155567.104324] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[155567.107817] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[155567.116905] amdgpu 0000:c1:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003600
[155567.123374] thunderbolt 0000:c3:00.6: 0: failed to allocate DP resource for port 7
[155577.582467] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155579.172466] thunderbolt 0000:c3:00.6: 0:6 <-> 702:10 (DP): not active, tearing down
[155587.822561] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155598.062430] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155608.302723] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155618.542853] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155628.783083] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155639.023064] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155649.263201] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
And the hung tasks logs. Few kworkers hanging, but all traces are the same:
[155725.040900] INFO: task kworker/9:2:70120 blocked for more than 122 seconds.
[155725.040909] Tainted: G W OE 6.18.4-arch1-1 #1
[155725.040911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[155725.040912] task:kworker/9:2 state:D stack:0 pid:70120 tgid:70120 ppid:2 task_flags:0x4208060 flags:0x00080000
[155725.040918] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[155725.041145] Call Trace:
[155725.041146] <TASK>
[155725.041150] __schedule+0x418/0x1320
[155725.041159] ? ttwu_queue_wakelist+0xfe/0x120
[155725.041164] schedule+0x27/0xd0
[155725.041166] schedule_timeout+0xbd/0x100
[155725.041170] dma_fence_default_wait+0x196/0x270
[155725.041175] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[155725.041176] dma_fence_wait_timeout+0x129/0x150
[155725.041178] amdgpu_tlb_fence_work+0x2c/0xe0 [amdgpu 6422097874d6b256c402231ccda3be13871c9e72]
[155725.041274] process_one_work+0x193/0x350
[155725.041279] worker_thread+0x2d7/0x410
[155725.041281] ? __pfx_worker_thread+0x10/0x10
[155725.041282] kthread+0xfc/0x240
[155725.041285] ? __pfx_kthread+0x10/0x10
[155725.041286] ? __pfx_kthread+0x10/0x10
[155725.041286] ret_from_fork+0x1c2/0x1f0
[155725.041291] ? __pfx_kthread+0x10/0x10
[155725.041292] ret_from_fork_asm+0x1a/0x30
[155725.041297] </TASK>
I’ve seen some TLB fence changes in 6.18.4, not sure how related these are. Before this even I’ve been using 6.18.3 for quite a while w/ success, but gosh, I honesly feel that I was just lucky… /sad face/