Outstanding problems with LG monitor Thunderbolt stability / GPU hangs

Thread summary

Since this post is evolving aggressively, let me provide a short tl;dr w/ links of what are my findings:

  1. LG Thunderbolt is unstable, not GPU related - it also works well when daisy-chained via TB dock
  2. After flickering/instability to fixed, I started to get GPU page faults
    a) I’m using Arch - distro failed to pull the firmware fix for Strix Point
    b) @Mario_Limonciello made me realize that for some people it may not be obvious to regenerate initramfs after updating firmware, mind this while trying to remediate the problem
    c) It’s always good to verify the running version of firmware by inspecting sysfs (alternative helpful commands below)
  3. Still getting hangs on the older MES firmware (0x80), I’m currently experimenting with amdgpu.cwsr_enable=0 kernel commandline parameter
    a) didn’t help, problem still occurs
    b) apparently this may not be the best course of action for Strix Point, if the problem reoccur I’ll attempt to debug it more properly
  4. Attempted to decrease the iGPU assigned VRAM to default 512MB and rely on GTT + decreased refresh rate, just to change something, didn’t help
  5. Plugged monitor via DP instead of Thunderbolt, didn’t help either

Checking running firmware version

A big gun for all the AMD cards in the system:

# grep . /sys/module/amdgpu/drivers/pci\:amdgpu/*/fw_version/*

Alternatively, with nicer formatting:

# grep . /sys/kernel/debug/dri/0000:*/amdgpu_firmware_info

Wildcards can be replaced with PCIE address.


Original post

I’m starting this exploratory thread as I still have no full understanding of the problem and it’s scope. Curious if anyone else have similar observations.

TL;DR
After swapping motherboard to new one with Ryzen AI 370, using of my LG monitoring w/ Thunderbolt (or USB3 mode, as it also supports it) is almost impossible due to recent connection drops / reconnections, even though it worked perfectly fine on the previous generation of the motherboard.

What doesn’t work

After motherboard update I immediately noticed different behavior while connecting external screen (LG 38WN95C-W). First impression was that connecting kinda goes in two phases, where device is enumerated, then it disconnects for 1s and reconnects back. It was stable for few seconds and then become the disconnect/reconnect flapping for good and I could break it just by disconnecting the screen completely. Even though nothing changed in my setup I tried different cables and got most bizarre results:

  • the cable I used so far was resulting in reproducible flapping
  • short, passive TB4 (certified!) cable seemed to work, but PD negotiated only 60W instead of expected 94W
  • active TB4 cable finally worked as expected and also seemed stable

I continued to dig and attempted multiple things to rule out software problems - bumped the kernel to newest version, made sure that I run the newest linux firmware, made sure that the laptop firmware is up to date also. Shut the laptop down and restarted. At this point all the cables I tried started to work.

What works, so far

This morning I attempted to use my setup for longer time and all of the problems hit me back. I couldn’t get stable session for longer than 1 minute. To this moment I was trying to connect only to the ports on the left side of the device (both, Thunderbolt and USB3) with the same result, so I attempted to connect to the right side with the long, active cable. I heard the disconnect/reconnect sounds for 4-5 times but it clicked and works stable for ~10-15 minutes so far. It’s not my preferred side of the device though + frankly, I’m only waiting for the problems to re-appear.

My thoughts so far / observations

There’s few things worth to mention:

  • the charging status led also blinks while the “flapping”
  • when booting the laptop with monitor connected, problems starts only after starting the windowed session, so after loading the amdgpu driver
  • unloading ucsi_acpi module didn’t help (I suspected some kernel-PD interaction for a while)
  • I’m using the newest available kernel (6.18.2, also zen variant available in Arch repository) which was reported to be problematic, haven’t tried older one
  • lastly it’s worth to mention that I had problems with this monitor in the past, on certain ThinkPad model, so it also falls into suspects bucket

Nothing of the above explains why connection seem more stable on the port from the right side of the laptop nor why it worked just fine on the older generation of the motherboard.

I’ll continue my observations and gonna experiment a bit with the kernels. Also gonna update this thread once I have more data.

And it exploded on the right-side port now. Exactly in the moment when I was attaching USB-C device to the middle port. It was flapping over and over up to the moment, when I clicked something on the laptop keyboard (!). Seemed surreal, but it was so weirdly correlated, that I’m willing to believe it somehow “unstuck” it…

Snipped of EC logs:

[90148.454900 cypd_write_reg8_wait_ack timeout on interrupt]
[90148.456100 cypd_cfet_vbus_control:0 fail:5]
[90148.462000 cypd_write_reg8_wait_ack C:0 0x2032 response 0x0]
[90148.465000 cypd_cfet_vbus_control:1 fail:5]
[90148.477700 cypd_write_reg8_wait_ack C:1 0x2032 response 0x0]
[90148.479600 cypd_cfet_vbus_control:3 fail:5]
[90148.481600 AC off]
[90148.633400 PMF: SPL 20000mW, sPPT 20000mW, fPPT 30000mW, p3T 60000mW, ao_sppt 0mW]
[90148.637600 events = 0, pre_events = 2]
[90148.638600 set AP throttling type 1 to off (0x00000000)]
[90148.639900 event set 0x0000000000000010]
[90148.651500 Updating charger with EPR correction: ma 490, 3lvl_buck ma 490]
[90148.656500 CL: p-1 s-1 i500 v0]
[90148.657600 event set 0x0100000000000000]
[90148.669200 PMF: SPL 20000mW, sPPT 20000mW, fPPT 30000mW, p3T 60000mW, ao_sppt 0mW]
[90148.666200 3lv-buck update! V:0mV,W:0mW]
PORT80: AA8E
[90148.735000 Battery 98% (Display 100.0 %) / 6h:0 to empty, not accepting current]
PORT80: 0008
[90149.301600 Battery 98% (Display 97.9 %) / 5h:58 to empty, not accepting current]
[90149.305200 event set 0x0100000000000000]
PORT80: AA8F
[90149.416600 CYPD_RESPONSE_PORT_CONNECT 0]
[90149.421300 board_set_active_charge_port port 0, prev:-1]
[90149.439300 cypd_write_reg8_wait_ack C:1 0x1032 response 0x0]
[90149.441900 cypd_cfet_vbus_control:2 fail:5]
[90149.448400 cypd_write_reg8_wait_ack C:1 0x2032 response 0x0]
[90149.450600 cypd_cfet_vbus_control:3 fail:5]
[90149.452200 Updating charger with EPR correction: ma 1470, 3lvl_buck ma 1470]
[90149.461300 CL: p0 s1 i1500 v5000]
[90149.463600 AC on]
[90149.489200 event set 0x0000000000000008]
[90149.498200 board_set_active_charge_port port 0, prev:0]
[90149.500400 cypd_write_reg8_wait_ack pre 0x2 ]
[90149.504500 3lv-buck update! V:5000mV,W:7500mW]
[90149.577800 Battery 98% (Display 97.9 %) / 5h:58 to empty, not accepting current]
[90149.612000 3Level-Buck is PTM mode]
PORT80: 0020
PORT80: 3F30
PORT80: 0020
[90149meout on interrupt]
[90148.454900 cypd_write_reg8_wait_ack timeout on interrupt]
[90148.456100 cypd_cfet_vbus_control:0 fail:5]
[90148.462000 cypd_write_reg8_wait_ack C:0 0x2032 response 0x0]
[90148.465000 cypd_cfet_vbus_control:1 fail:5]
[90148.477700 cypd_write_reg8_wait_ack C:1 0x2032 response 0x0]
[90148.479600 cypd_cfet_vbus_control:3 fail:5]
[90148.481600 AC off]
[90148.633400 PMF: SPL 20000mW, sPPT 20000mW, fPPT 30000mW, p3T 60000mW, ao_sppt 0mW]
[90148.637600 events = 0, pre_events = 2]
[90148.638600 set AP throttling type 1 to off (0x00000000)]
[90148.639900 event set 0x0000000000000010]
[90148.651500 Updating charger with EPR correction: ma 490, 3lvl_buck ma 490]
[90148.656500 CL: p-1 s-1 i500 v0]
[90148.657600 event set 0x0100000000000000]
[90148.669200 PMF: SPL 20000mW, sPPT 20000mW, fPPT 30000mW, p3T 60000mW, ao_sppt 0mW]
[90148.666200 3lv-buck update! V:0mV,W:0mW]
PORT80: AA8E
[90148.735000 Battery 98% (Display 100.0 %) / 6h:0 to empty, not accepting current]
PORT80: 0008
[90149.301600 Battery 98% (Display 97.9 %) / 5h:58 to empty, not accepting current]
[90149.305200 event set 0x0100000000000000]
PORT80: AA8F
[90149.416600 CYPD_RESPONSE_PORT_CONNECT 0]
[90149.421300 board_set_active_charge_port port 0, prev:-1]
[90149.439300 cypd_write_reg8_wait_ack C:1 0x1032 response 0x0]
[90149.441900 cypd_cfet_vbus_control:2 fail:5]
[90149.448400 cypd_write_reg8_wait_ack C:1 0x2032 response 0x0]
[90149.450600 cypd_cfet_vbus_control:3 fail:5]
[90149.452200 Updating charger with EPR correction: ma 1470, 3lvl_buck ma 1470]
[90149.461300 CL: p0 s1 i1500 v5000]
[90149.463600 AC on]
[90149.489200 event set 0x0000000000000008]
[90149.498200 board_set_active_charge_port port 0, prev:0]
[90149.500400 cypd_write_reg8_wait_ack pre 0x2 ]
[90149.504500 3lv-buck update! V:5000mV,W:7500mW]
[90149.577800 Battery 98% (Display 97.9 %) / 5h:58 to empty, not accepting current]
[90149.612000 3Level-Buck is PTM mode]
PORT80: 0020
PORT80: 3F30
PORT80: 0020
[90149.688000 cypd_write_reg8_wait_ack pre 0x2 ]
[90149.706500 cypd_write_reg8_wait_ack pre 0x2 ]
[90149.711200 CCG_RESPONSE_ACCEPT_MSG_RX 0]
[90149.712700 Updating charger with EPR correction: ma 490, 3lvl_buck ma 400]
[90149.708400 Battery 98% (Display 100.0 %) / 5h:58 to empty, not accepting current]
[90149.716400 CL: p0 s0 i500 v5000 (forced)]
[90149.784600 CYPD_RESPONSE_PD_CONTRACT_NEGOTIATION_COMPLETE 0]
Port:0 Unknown data type: 0x03 Hdr:0x8b83 ExtHdr:0x0001 Data:0x00
[90149.789500 board_set_active_charge_port port 0, prev:0]
Port:0 Unknown data type: 0x04 Hdr:0x8d84 ExtHdr:0x0001 Data:0x00
Port:0 Unknown data type: 0x06 Hdr:0x8186 ExtHdr:0x0002 Data:0x0000
[90149.801800 cypd_write_reg8_wait_ack pre 0x2 ]
[90149.816500 CCG_RESPONSE_VDM_RX]
[90149.824100 CCG_RESPONSE_ACCEPT_MSG_RX 0]
[90149.816200 event set 0x0100000000000000]
[90149.840500 Updating charger with EPR correction: ma 4606, 3lvl_buck ma 4606]
[90149.845000 CL: p0 s0 i4700 v20000]
[90149.848000 board_set_active_charge_port port 0, prev:0]
[90149.854400 cypd_write_reg8_wait_ack C:0 0x1032 response 0x0]
[90149.856100 cypd_cfet_vbus_control:0 fail:5]
[90149.857300 3lv-buck update! V:20000mV,W:94000mW]

Now that’s interesting. I run some steam game on my Thunderbolt eGPU for ~1h to check the port stability and everything is just fine. Not a single glitch.

I guess that the culprit here is the LG monitor and its port after all. As mentioned in one of the previous posts, it was acting up on another machine anyway. Also I’ve seen another thread on this forum with someone complaining on LG thunderbolt monitor.

It’s a pity that it worked on the previous generation and refuse to work on the current one though.

I’ll keep this thread maintained for a while to check if anyone else shares the rough experience of mine.

This is a very similar/probably the same issue to what I am having. I don’t believe it’s the monitor DISPLAY so much as it is the Thunderbolt hub/controller inside it.

I have a Thunderbolt 5 hub at my desk that is primarily for my work laptop and secondarily my FW16. I don’t use TB itself for carrying the display signal and instead have a USB-C to DisplayPort adapter. It worked flawlessly with my previous mainboard.

Now, however, it does not. If I take that USB-C to DP cable and plug it directly into the FW16, it works fine. However once it gets attached to a Thunderbolt controller, the clock is ticking until the entire thing disconnects and reconnects.

1 Like

If you haven’t looked already; I would look for a firmware update for the monitor.

1 Like

Good call. I used to check it regularly, but indeed haven’t done so in few months. It requires Windows laptop so it always takes me a while to find a box that I could use. Last time was merely few months ago whereas I have the monitor for way over 2 years, I don’t believe anything changed here.

Also mind the irony - updating the firmware requires stable USB connection, which is exactly the problem.

Interesting fact - after buying the monitor, when it was acting up on my old ThinkPad, I requested a service help. The motherboard got replaced, problem persisted.

@Mario_Limonciello on the other note, I’m experiencing regular GPU hungs when connected directly via DP (well it’s DP→USB-C adapter, so alternate mode kicks in I guess). One even happened while I was writing this post. I’ve seen the issues with firmware/kernel combinations, having installed newest of both though. Could it be still related though? I’m particularly looking at this. Grim things is that recovery/reset is partial - my external screen works, but laptop screen remain frozen. If I attempt to change any of the screen related settings I end up w/ permanent freeze on both and I’m forced to power cycle.

With this state the only viable action I can think of is to go back to my old motherboard… :frowning:

[23271.469659] ------------[ cut here ]------------
[23271.469665] WARNING: CPU: 3 PID: 10961 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_replay.c:90 dmub_replay_enable+0xf2/0x160 [amdgpu]
[23271.469984] Modules linked in: ccm snd_seq_dummy rfcomm snd_hrtimer snd_seq snd_seq_device uhid cmac algif_hash algif_skcipher af_alg bnep vfat fat snd_acp_legacy_mach snd_acp_mach snd_soc_nau8821 snd_acp3x_rn snd_acp70 snd_acp_i2s snd_acp_pdm snd_soc_dmic snd_acp_pcm snd_sof_amd_acp70 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_pci_ps snd_soc_acpi_amd_match snd_amd_sdw_acpi mt7921e soundwire_amd mt7921_common soundwire_generic_allocation mt792x_lib amd_atl soundwire_bus intel_rapl_msr mt76_connac_lib intel_rapl_common snd_hda_codec_alc269 snd_soc_sdca hid_sensor_als snd_hda_scodec_component snd_soc_core mt76 hid_sensor_trigger snd_hda_codec_realtek_lib snd_hda_codec_atihdmi industrialio_triggered_buffer snd_compress snd_hda_codec_generic snd_hda_codec_hdmi kfifo_buf snd_hda_intel ac97_bus hid_sensor_iio_common snd_pcm_dmaengine snd_hda_codec mac80211 btusb industrialio snd_rpl_pci_acp6x btmtk snd_acp_pci
[23271.470023]  snd_hda_core uvcvideo snd_amd_acpi_mach btrtl snd_intel_dspcfg btbcm snd_acp_legacy_common videobuf2_vmalloc snd_intel_sdw_acpi kvm_amd btintel uvc snd_pci_acp6x snd_hwdep videobuf2_memops spd5118 joydev mousedev cfg80211 bluetooth cdc_acm snd_pci_acp5x snd_pcm videobuf2_v4l2 cros_ec_hwmon cros_ec_debugfs sp5100_tco leds_cros_ec videobuf2_common cros_ec_chardev ucsi_acpi kvm snd_rn_pci_acp3x cros_ec_sysfs led_class_multicolor typec_ucsi snd_timer cros_charge_control gpio_cros_ec videodev snd_acp_config typec snd_soc_acpi snd rfkill hid_multitouch irqbypass i2c_piix4 hid_sensor_hub roles mc cros_ec_dev rapl pcspkr wmi_bmof amd_pmf amdxdna k10temp soundcore snd_pci_acp3x libarc4 i2c_smbus thunderbolt amdtee cros_ec_lpcs i2c_hid_acpi cros_ec i2c_hid amd_sfh platform_profile cros_ec_proto amd_pmc 8250_dw mac_hid i2c_dev crypto_user ntsync acpi_call(OE) nfnetlink dm_crypt encrypted_keys trusted asn1_encoder tee dm_mod amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm nvme drm_exec drm_panel_backlight_quirks
[23271.470069]  gpu_sched nvme_core drm_suballoc_helper drm_buddy polyval_clmulni nvme_keyring ghash_clmulni_intel aesni_intel drm_display_helper nvme_auth video cec ccp hkdf wmi
[23271.470082] CPU: 3 UID: 0 PID: 10961 Comm: kworker/u96:1 Tainted: G           OE       6.18.2-zen2-1-zen #1 PREEMPT(full)  817688afc19ca15a22737742591535351aba70f8
[23271.470087] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[23271.470088] Hardware name: Framework Laptop 16 (AMD Ryzen AI 300 Series)/FRANMHCP09, BIOS 03.04 11/06/2025
[23271.470091] Workqueue: dm_vblank_control_workqueue amdgpu_dm_crtc_vblank_control_worker [amdgpu]
[23271.470378] RIP: 0010:dmub_replay_enable+0xf2/0x160 [amdgpu]
[23271.470579] Code: 00 00 00 3d ff 00 00 00 74 c7 45 84 f6 74 66 85 c0 75 57 bf ac c4 20 00 41 83 c5 01 e8 17 0e b2 e8 41 81 fd e9 03 00 00 75 a5 <0f> 0b 48 8b 44 24 48 65 48 2b 05 17 a5 a1 ea 75 55 48 83 c4 50 5b
[23271.470580] RSP: 0018:ffffc9b10f1cfd18 EFLAGS: 00010246
[23271.470582] RAX: 00002a4505724eac RBX: 0000000000000001 RCX: 0000000000000003
[23271.470583] RDX: 00000000000f3ea8 RSI: 00000000000f3ac3 RDI: 00002a4505631004
[23271.470584] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000036aa
[23271.470584] R10: 00000000000036b8 R11: ffff896f81073be0 R12: ffff896f81073be0
[23271.470585] R13: 00000000000003e9 R14: 0000000000000001 R15: 0000000000000000
[23271.470586] FS:  0000000000000000(0000) GS:ffff897d1a36f000(0000) knlGS:0000000000000000
[23271.470587] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23271.470587] CR2: 0000125c01a6d000 CR3: 000000010b170000 CR4: 0000000000f50ef0
[23271.470588] PKRU: 55555554
[23271.470590] Call Trace:
[23271.470594]  <TASK>
[23271.470596]  edp_set_replay_allow_active+0x17b/0x1d0 [amdgpu 0b518e18b5e767aa3221d10620eaa35a3fe1cfbd]
[23271.470787]  amdgpu_dm_replay_enable+0xcc/0x100 [amdgpu 0b518e18b5e767aa3221d10620eaa35a3fe1cfbd]
[23271.470972]  amdgpu_dm_crtc_vblank_control_worker+0xf9/0x2c0 [amdgpu 0b518e18b5e767aa3221d10620eaa35a3fe1cfbd]
[23271.471134]  process_one_work+0x193/0x350
[23271.471141]  worker_thread+0x254/0x3a0
[23271.471143]  ? __pfx_worker_thread+0x10/0x10
[23271.471145]  kthread+0xfc/0x240
[23271.471149]  ? schedule_tail+0xa0/0x360
[23271.471153]  ? __pfx_kthread+0x10/0x10
[23271.471155]  ret_from_fork+0x1c2/0x1f0
[23271.471160]  ? __pfx_kthread+0x10/0x10
[23271.471162]  ret_from_fork_asm+0x1a/0x30
[23271.471168]  </TASK>
[23271.471169] ---[ end trace 0000000000000000 ]---
[23272.838370] cros-ec-dev cros-ec-dev.1.auto: Some logs may have been dropped...
[23352.267671] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32787)
[23352.267681] amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 2965 thread brave:cs0 pid 2990
[23352.267683] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
[23352.267686] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701430
[23352.267687] amdgpu 0000:c1:00.0: amdgpu:      Faulty UTCL2 client ID: SQC (data) (0xa)
[23352.267689] amdgpu 0000:c1:00.0: amdgpu:      MORE_FAULTS: 0x0
[23352.267690] amdgpu 0000:c1:00.0: amdgpu:      WALKER_ERROR: 0x0
[23352.267691] amdgpu 0000:c1:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[23352.267691] amdgpu 0000:c1:00.0: amdgpu:      MAPPING_ERROR: 0x0
[23352.267692] amdgpu 0000:c1:00.0: amdgpu:      RW: 0x0
[23362.436496] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[23362.437831] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[23362.437944] amdgpu 0000:c1:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[23362.437946] amdgpu 0000:c1:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[23362.437948] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=253330, emitted seq=253332
[23362.437950] amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 2965 thread brave:cs0 pid 2990
[23362.437953] amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[23364.441760] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[23364.441771] amdgpu 0000:c1:00.0: amdgpu: failed to reset legacy queue
[23364.441773] amdgpu 0000:c1:00.0: amdgpu: reset via MES failed and try pipe reset -110
[23364.441775] amdgpu 0000:c1:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
[23364.441776] amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[23364.441780] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!. Source:  1
[23364.833264] amdgpu 0000:c1:00.0: amdgpu: Register(0) [regVPEC_QUEUE_RESET_REQ] failed to reach value 0x00000000 != 0x00000001n
[23364.833272] amdgpu 0000:c1:00.0: amdgpu: VPE queue reset failed
[23366.838372] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[23366.838383] amdgpu 0000:c1:00.0: amdgpu: failed to unmap legacy queue
[23367.097984] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[23367.099335] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[23367.124630] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[23367.124952] [drm] PCIE GART of 512M enabled (table at 0x00000081FFB00000).
[23367.124996] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[23367.129359] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[23367.139416] amdgpu 0000:c1:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003500
[23367.384886] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[23367.384892] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[23367.384893] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[23367.384894] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[23367.384895] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[23367.384896] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[23367.384896] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[23367.384897] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[23367.384898] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[23367.384898] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[23367.384899] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[23367.384900] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 1 on hub 8
[23367.384900] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[23367.384901] amdgpu 0000:c1:00.0: amdgpu: ring vpe uses VM inv eng 4 on hub 8
[23368.453055] amdgpu 0000:c1:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vpe (-110).
[23368.453396] amdgpu 0000:c1:00.0: amdgpu: ib ring test failed (-110).
[23368.603584] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[23368.629214] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[23368.629469] [drm] PCIE GART of 512M enabled (table at 0x00000081FFB00000).
[23368.629524] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[23368.634450] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[23368.644348] amdgpu 0000:c1:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003500
[23369.323683] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[23369.323689] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[23369.323690] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[23369.323691] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[23369.323692] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[23369.323693] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[23369.323693] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[23369.323694] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[23369.323695] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[23369.323695] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[23369.323696] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[23369.323697] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 1 on hub 8
[23369.323697] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[23369.323698] amdgpu 0000:c1:00.0: amdgpu: ring vpe uses VM inv eng 4 on hub 8
[23369.327409] amdgpu 0000:c1:00.0: amdgpu: GPU reset(1) succeeded!
[23369.327417] amdgpu 0000:c1:00.0: [drm] device wedged, but recovered through reset
[23369.335327] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* Failed to initialize parser -125!
[23379.844603] amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:86:crtc-0] flip_done timed out
[23379.844603] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* [CRTC:86:crtc-0] hw_done or flip_done timed out
1 Like

Can you please double check you have the latest upstream amdgpu firmware? There is a revert I want to make sure you picked up.

Absolutely no doubts about it. The only thing that comes to my mind, that might have contributed to the wedges I experienced, was multiple display reconnections, as so far I haven’t encounter hung without those.

Pardon taking the easy way here, but if you could educate me whether there are flags / params that I could enable to make the debugging easier it would be awesome. I’m suffering for parental, chronic lack of time but would love give my 5 cents in those investigations without recapping tons of posts from random sources.

My box:

$ yay -Si linux-firmware-amdgpu
Repository      : core
Name            : linux-firmware-amdgpu
Version         : 20251125-2
Description     : Firmware files for Linux - Firmware for AMD Radeon GPUs
Architecture    : any
URL             : https://gitlab.com/kernel-firmware/linux-firmware
Licenses        : LicenseRef-WHENCE  LicenseRef-amdgpu  MIT
Groups          : None
Provides        : None
Depends On      : linux-firmware-whence
Optional Deps   : None
Conflicts With  : None
Replaces        : None
Download Size   : 25.32 MiB
Installed Size  : 26.04 MiB
Packager        : Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Build Date      : Tue 02 Dec 2025 12:00:56 AM CET
Validated By    : SHA-256 Sum  Signature

Changelog:

Boom, it exploded again. This time no cable mangling, but I did start more apps - as previously I’ve been using only brave browser, this time I also had vscodium and kicad running. As I recall, each of the previous hangs also happened when those were running.

Also this time I was running kernel with those patches applied and system never recovered. It just hung dead. Also logs differ:

Dec 30 16:51:11 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:11 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:14 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:14 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:17 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:17 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:20 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:20 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:22 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:22 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:25 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:25 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:28 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:28 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:31 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:31 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:34 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:34 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:36 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:36 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:39 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:39 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:42 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:42 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:44 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:44 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:47 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:47 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:52 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
Dec 30 16:51:52 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:51:52 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reg_write_reg_wait
Dec 30 16:51:55 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:51:57 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:00 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:02 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:05 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:08 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:10 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:13 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.
Dec 30 16:52:15 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES ring buffer is full.

Edit: it also manifested weirdly - first I noticed that laptop screen was frozen (it was running some web video), when I moved my screen over it the external screen also froze. Sounds remained running, I was even able to blindly pause the video.
Edit2: I was still running -zen kernel, trying vanilla one now.

Ok, so I decided to go big(ger) and bought myself a HP Thunderbolt 4 Ultra docking station. Having monitor daisy-chained through it is 100% fine. I also don’t experience any sort of rendering lockups / gpu page faults.

Solution more pricey than I’d assume, but at least I have plenty of ports now. Also this dock can be easily recommended - supports PD3.1 (up to 180W, what’s perfect), native USB4 + updates via LVFS, so it’s even Linux friendly.

And I must take it back. Same type of hung happened again (MES failed to respond to msg kind). Seems unrecoverable.

It’s followed by bunch of i2c errors that come from powerdevil, I bet it’s related to attempts to talk to the monitor to control its brightness. Nevertheless this is the first time I see those.

Edit: Happened again, second time in 20 minutes. In both cases I was running youtube window on laptop internal screen. I’m attempting to run with MES disabled not (amdgpu.mes=0 in kernel commandline options).

This time it is page fault though:

Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32787)
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 4557 thread brave:cs0 pid 4585
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501430
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          MORE_FAULTS: 0x0
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          WALKER_ERROR: 0x0
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jan 02 11:16:54 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          RW: 0x0
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=251910, emitted seq=251912
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 4557 thread brave:cs0 pid 4585
Jan 02 11:17:04 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reset legacy queue
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: reset via MES failed and try pipe reset -110
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failed
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!. Source:  1
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Register(0) [regVPEC_QUEUE_RESET_REQ] failed to reach value 0x00000000 != 0x00000001n
Jan 02 11:17:06 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: VPE queue reset failed
Jan 02 11:17:08 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Jan 02 11:17:08 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to unmap legacy queue
Jan 02 11:17:08 fw16 kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

Edit 2: Possible explanation: [Issue]: amdgpu firmware (MES 0x83) causing GPU Hang / Memory access fault w/ Strix Halo · Issue #5724 · ROCm/ROCm · GitHub

@Mario_Limonciello thanks for your help and patience across so many complaints so far. I’m sadly joining the group that overly trusted the distro releases - Arch, similarly to Fedora, failed to pull all the reverts you landed. In fact it seems that they cherry picked only the ones for Strix Halo, not Point.

After manual downgrade things seem to be stable. I’ll give it a bit more time to cook within my setup and gonna update the first post here with very descriptive warning regarding Arch firmware package.

And let me just continue my monologue here, as the problem reoccurred, it just took a while.

Scenario the same - while using brave, having kicad and vscodium running alongside. Freeze, then no signal to the monitors and lack of recovery. This time I’m definitely running reverted firmware.

Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32790)
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 5445 thread brave:cs0 pid 5460
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000d00041400000 from client 10
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601430
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          MORE_FAULTS: 0x0
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          WALKER_ERROR: 0x0
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jan 03 22:13:23 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:          RW: 0x0
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=354229, emitted seq=354231
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 5445 thread brave:cs0 pid 5460
Jan 03 22:13:33 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reset legacy queue
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: reset via MES failed and try pipe reset -110
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failed
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!. Source:  1
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: Register(0) [regVPEC_QUEUE_RESET_REQ] failed to reach value 0x00000000 != 0x00000001n
Jan 03 22:13:35 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: VPE queue reset failed
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to unmap legacy queue
Jan 03 22:13:37 fw16 kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 03 22:13:37 fw16 kernel: [drm] PCIE GART of 512M enabled (table at 0x00000081FFB00000).
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
Jan 03 22:13:37 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003500
Jan 03 22:13:37 fw16 kernel: thunderbolt 0000:c3:00.6: 0: failed to allocate DP resource for port 7
Jan 03 22:13:48 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
Jan 03 22:13:49 fw16 kernel: thunderbolt 0000:c3:00.6: 0:6 <-> 702:10 (DP): not active, tearing down

And firmware versions:

asd_fw_version     0x21000104
dmcub_fw_version   0x09003500
imu_fw_version     0x0b332000
mec_fw_version     0x00000020
me_fw_version      0x00000020
mes_fw_version     0x00000080
mes_kiq_fw_version 0x0000006f
pfp_fw_version     0x0000002e
rlc_fw_version     0x11510546
sdma_fw_version    0x0000000e
smc_fw_version     0x0b5d0a00
vcn_fw_version     0x0911801b

FWIW I have 7900xtx connected via Thunderbolt, I presume it may be making internal driver state more complex. I’m about to update to 6.18.3 if that makes any difference and will continue to observe.

Edit: brave seems to be the trigger in most of the cases. Some pages trigger it much faster than others. I just had two hangs in a row (within 6m :frowning:) while having ChatGPT tab open. This time 7900XTX was completely disabled, so it’s definitely not a contributing factor here.

Edit 2: Pulled whole amdgpu firmware directory from linux-firmware, DMCUB got a small bump. I bet it won’t make any difference though, wouldn’t make much sense. I’m also experimenting with amdgpu.cwsr_enable=0, as I’ve seen reports that it’s needed for MES 0x80.

1 Like

Literally as I was writing the last updates it hanged again, so it seems that none of the known remediations work for me now.

Anything electron-based seen to be triggering the hang after few moments of using it. Laptop in this form is literally unusable.

There are no page fault logs from the last crash, probably because I already power-cycled the laptop quickly and they didn’t flush on time. I’m done waiting for recovery that never happens.

FWIW my current setup:

  • vanilla kernel 6.18.3
  • MES firmware 0x80 (pulled all the new firmware from amdgpu/ folder of linux-firmware repository
  • amdgpu.cwsr_enable=0 in commandline
  • GPU configured to have 8GB of memory (I have no idea what’s causing the page fault, but seem memory related)

I’m using internal laptop screen and external Thunderbolt screen attached via Thunderbolt4/USB4 docking station.

Can you please drop this? I don’t believe you should need it on the Strix model.

1 Like

Removed. Is there any sense in experimenting with memory assigned to the card? I’m currently having 8GB, I assume that this may differ from default 0.5GB that lots of system uses. I don’t mid relying on GTT.

I don’t know the nature of this problem nor I’m experienced with debugging GPU page faults, but fiddling with its memory settings would be my intuitive next step.

Edit: So, following the debugging documentation it seems like the culprit is a shader attempting to access invalid page in the card address space. Again, I have no idea how this works in the GPU world, but given that shaders are configured by application / user, they may be buggy and I’d expect the whole system to be immune for such faults. Unless this is just a symptom and it hung internally beyond recovery, what would explain why reset doesn’t work.

In addition to my previous question, assuming we may be experiencing weird firmware problems, I’m wondering whether it makes sense to experiment with settings like vm_update_mode and move it to CPU?

So pre-emption is how you debug these kinds of problems (you would use something like ROCgdb). If you turn off CWSR then you can no longer pre-empt workloads. I am worried that turning off CWSR actually could be causing the issue, but let’s see.

Moving it to CPU will be a very expensive hit to performance, but sometimes helps with these types of problems too.

1 Like

Almost 5 days, so far so good. There are two changes I introduced though:

  • dedicated vram set to default 512MB
  • decreased screen refresh rate from 144Hz to 120Hz

From those two I’d only expect the VRAM setting to be meaningful though.

Edit: Happened again within 5 minutes. Switching full card in Google Photos seems to be triggering it quite aggressively. I’ll check on rocm-gdb. Have no idea how to use it but if I succeed maybe I’ll gather more data.

Edit2: Or won’t… rocm-gdb seems like a tool to debug rocm hip kernels. Unless I’m missing something.

Edit3: Stability is awful. Hang every 5-10 minutes. I rolled back the kernel to 6.18.3 to see if this is a matter of my current workflow or the kernel itself. linux-firmware also got updated in the meantime, but I assume that new release should contain all the up-to-date stuff.

Edit4: The same thing is happening w/ 6.18.3. It may be just a wild impression, but it seems that it triggers much faster on 6.18.4 - I managed to make it happen 3 times in a row within 15 minutes window, whereas it took me quite a while on 6.18.3. Similar workload.

Happened again, so no, things are not resolved. While running kernel 6.18.4-arch kernel, browsing Google Photos w/ Brave.

Monitors didn’t recover and remained black / disabled. This time I let it cook for a moment before reboot + took a look remotely. Hang log doesn’t stand out, but this time I also caught hung tasks. I preserved the coredump and can provide it if needed.

[155552.381837] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32788)
[155552.381844] amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 3237 thread brave:cs0 pid 3262
[155552.381846] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
[155552.381848] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601430
[155552.381849] amdgpu 0000:c1:00.0: amdgpu:     Faulty UTCL2 client ID: SQC (data) (0xa)
[155552.381850] amdgpu 0000:c1:00.0: amdgpu:     MORE_FAULTS: 0x0
[155552.381851] amdgpu 0000:c1:00.0: amdgpu:     WALKER_ERROR: 0x0
[155552.381852] amdgpu 0000:c1:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[155552.381852] amdgpu 0000:c1:00.0: amdgpu:     MAPPING_ERROR: 0x0
[155552.381853] amdgpu 0000:c1:00.0: amdgpu:     RW: 0x0
[155562.733993] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[155562.734971] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[155562.735061] amdgpu 0000:c1:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[155562.735063] amdgpu 0000:c1:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[155562.735064] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=5790839, emitted seq=5790841
[155562.735066] amdgpu 0000:c1:00.0: amdgpu:  Process brave pid 3237 thread brave:cs0 pid 3262
[155562.735068] amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[155564.738873] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[155564.738884] amdgpu 0000:c1:00.0: amdgpu: failed to reset legacy queue
[155564.738886] amdgpu 0000:c1:00.0: amdgpu: reset via MES failed and try pipe reset -110
[155564.738888] amdgpu 0000:c1:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
[155564.738889] amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[155564.738891] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!. Source:  1
[155566.887593] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[155566.887599] amdgpu 0000:c1:00.0: amdgpu: failed to unmap legacy queue
[155567.076743] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[155567.078035] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[155567.104173] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[155567.104310] [drm] PCIE GART of 512M enabled (table at 0x000000801FB00000).
[155567.104324] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[155567.107817] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[155567.116905] amdgpu 0000:c1:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003600
[155567.123374] thunderbolt 0000:c3:00.6: 0: failed to allocate DP resource for port 7
[155577.582467] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155579.172466] thunderbolt 0000:c3:00.6: 0:6 <-> 702:10 (DP): not active, tearing down
[155587.822561] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155598.062430] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155608.302723] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155618.542853] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155628.783083] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155639.023064] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
[155649.263201] amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!

And the hung tasks logs. Few kworkers hanging, but all traces are the same:

[155725.040900] INFO: task kworker/9:2:70120 blocked for more than 122 seconds.
[155725.040909]       Tainted: G        W  OE       6.18.4-arch1-1 #1
[155725.040911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[155725.040912] task:kworker/9:2     state:D stack:0     pid:70120 tgid:70120 ppid:2      task_flags:0x4208060 flags:0x00080000
[155725.040918] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[155725.041145] Call Trace:
[155725.041146]  <TASK>
[155725.041150]  __schedule+0x418/0x1320
[155725.041159]  ? ttwu_queue_wakelist+0xfe/0x120
[155725.041164]  schedule+0x27/0xd0
[155725.041166]  schedule_timeout+0xbd/0x100
[155725.041170]  dma_fence_default_wait+0x196/0x270
[155725.041175]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[155725.041176]  dma_fence_wait_timeout+0x129/0x150
[155725.041178]  amdgpu_tlb_fence_work+0x2c/0xe0 [amdgpu 6422097874d6b256c402231ccda3be13871c9e72]
[155725.041274]  process_one_work+0x193/0x350
[155725.041279]  worker_thread+0x2d7/0x410
[155725.041281]  ? __pfx_worker_thread+0x10/0x10
[155725.041282]  kthread+0xfc/0x240
[155725.041285]  ? __pfx_kthread+0x10/0x10
[155725.041286]  ? __pfx_kthread+0x10/0x10
[155725.041286]  ret_from_fork+0x1c2/0x1f0
[155725.041291]  ? __pfx_kthread+0x10/0x10
[155725.041292]  ret_from_fork_asm+0x1a/0x30
[155725.041297]  </TASK>

I’ve seen some TLB fence changes in 6.18.4, not sure how related these are. Before this even I’ve been using 6.18.3 for quite a while w/ success, but gosh, I honesly feel that I was just lucky… /sad face/

Not sure if this is worth nothing, but all the crashes seem to hold the be faulting on just two addresses:

Jan 12 11:43:28 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
Jan 12 12:07:00 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
Jan 12 14:30:20 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
Jan 12 17:12:44 fw16 kernel: amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10

Googling it actually shows that these are not so uncommon though…