[TRACKING] Graphical corruption in Fedora 39 (AMD 3.03 BIOS)

I have updated everything and still experience aggressive flickering under heavy CPU load.

OS: Fedora 39
Kernel: 6.8.8-200.fc39.x86_64
Bios: 03.05

Things that don’t make a difference:

  • Gaming mode in bios
  • Trying various kernel flags, e.g. amdgpu.sg_display=0 and others recommended in this thread
  • Reseating my internal display connector/cable on the motherboard

These workarounds get rid of flickering completely:

  • Either use only external monitors
  • Or unplug the charger when using the internal monitor

I hope one day I will be able to charge + internal monitor + heavy CPU usage :frowning:

Would you be able to share a video of the flickering?

Unfortunately I can’t attach videos here. I can reproduce this reliably by running supertux while forcing software rendering. This causes intense memory and cpu usage, leading to flicker. Unpluging my default power adapter gets rid of the flicker. Plugging it back in causes it to reoccur. My machine has 64 GB of ram.

sudo dnf install supertux
LIBGL_ALWAYS_SOFTWARE=1 supertux2

I’ve tried this on Fedora 40 with the newest BIOS and apart from very high CPU usage and fan noise, there’s no issue with flickering. (Without LIBGL_ALWAYS_SOFTWARE=1 there is no relevant fan noise or CPU usage)

Would be cool if OpenSIL someday lent itself to Coreboot support for AMD laptops

1 Like

Same as @Jonathan_Haas, couldn’t reproduce. BIOS 3.05/F39/GNOME/Wayland/7840U/64GB/no special kernel args.

Tried playing (I used to be good at platform games, sob) both with LIBGL_ALWAYS_SOFTWARE=1 and without it, laptop panel and external display, no issues other than CPU use and fan noise when enabling this variable.

Since power seems to be part of the issue, note I did this with the laptop plugged in to AC (dock provided PD) and “charged” at the BIOS-set limit of 70%. Maybe this (precise battery and AC use state) is a useful variable to explore.

Installed Fedora 40. Same issue. Battery percentage makes no difference. Video: https://files.catbox.moe/o9wqb6.webm

1 Like

Shot in the dark: Grounding/short issue?

Possible to test with different charger/cable combo(s)?

Edit: Also, try different power profile settings? Plugging into AC changes CPU power parameters by default, it would be useful to see if changing power profile while plugged in makes a difference.

1 Like

Yes, this looks nothing like the scatter/gather-related corruption that I (at least) had been experiencing, where the screen or a part of it would be filled with white either all the time or on every other frame, and the image would be fine otherwise (so one or both of the buffers used for double buffering in the compositor were broken and had white displayed instead of them). This looks more like a signal integrity problem. Grounding issue or bad power of some other kind? Misrouted display cable?

  • 18 watt smartphone charger: No flicker
  • 60 watt apple charger + different cable: Flicker
  • 98 watt dock connector + different cable: Flicker

Switching to Fedoras “Power Saver” mode leads to a huge improvement, but the flicker is still there. Performance mode is just as bad as “balanced”.

In the past this used to also happen when running purely on battery. But it got fixed by some update. So I used to assume that this is a driver issue. But maybe it is the display cable. I will investigate this later today.

I don’t know how to investigate grounding issues. If you are talking about “interference”, I’ve tried to get the cables farther away from the motherboard, with the same result.

I even tried reconnecting both display cables with no improvement. My right cable looks like it got chewed up a little from opening and closing the lid.

Other CPU intensive tasks, like compiling and video encoding, cause way less flickering than supertux.

That silver grounding/shielding strap on the display cable in your first picture looks beat up:

Screenshot from 2024-05-11 18-35-07

Does it stay stuck when you assemble everything?

(edit: should be sticking to the hinge)

Edit: This discussion probably needs a new thread btw.

It is still sticky, so I’ve reattached it to the hinge. No improvement.

I’m out of ideas. I think it’s time to get support involved if you haven’t already.

I can no longer reproduce the graphical corruption after upgrading to Fedora 40 and the BIOS.

1 Like

I’m on Fedora 40 with kernel 6.8.9-300.fc40.x86_64 and GNOME 46.1
Latest BIOS, 7840U with 96GB of ram.

My software packages are all updated as of now. I recently removed the amdgpu.sg_display=0 kernel command line. So far it has been fine as I was using the laptop display itself. I have set the VRAM allocation to gaming (so with 96GB ram I get 4GB allocated)

Today, I plugged in my 4K monitor (Samsung - Odyssey Neo G7 43") running at 120Hz via display port module (through usb-1 from the image here) with the lid of the laptop closed and the charger plugged-in (via usb-3)

The display kept turning off momentarily and came back on at random times, ranging from ~30 seconds to ~half an hour.

One of these times, I saw rainbow like glitches on the bottom half of the screen. Fortunately, unplugging and plugging back the display port cable recovered the whole screen and I didn’t lose my work.

I checked kernel logs with dmesg and didn’t see any errors.

...
[ 1303.450636] input: Logitech USB Receiver as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.0/0003:046D:C52B.0004/input/input15
[ 1303.503494] hid-generic 0003:046D:C52B.0004: input,hidraw3: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:c1:00.3-1/input0
[ 1303.509235] input: Logitech USB Receiver Mouse as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.1/0003:046D:C52B.0005/input/input16
[ 1303.509509] input: Logitech USB Receiver Consumer Control as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.1/0003:046D:C52B.0005/input/input17
[ 1303.561695] input: Logitech USB Receiver System Control as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.1/0003:046D:C52B.0005/input/input18
[ 1303.562020] hid-generic 0003:046D:C52B.0005: input,hiddev96,hidraw4: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:c1:00.3-1/input1
[ 1303.566694] hid-generic 0003:046D:C52B.0006: hiddev97,hidraw5: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1/input2
[ 1303.704109] logitech-djreceiver 0003:046D:C52B.0006: hiddev96,hidraw3: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1/input2
[ 1303.813151] input: Logitech Wireless Device PID:400a Mouse as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.2/0003:046D:C52B.0006/0003:046D:400A.0007/input/input20
[ 1303.813336] hid-generic 0003:046D:400A.0007: input,hidraw4: USB HID v1.11 Mouse [Logitech Wireless Device PID:400a] on usb-0000:c1:00.3-1/input2:1
[ 1303.883006] input: Logitech M325 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.2/0003:046D:C52B.0006/0003:046D:400A.0007/input/input24
[ 1303.883210] logitech-hidpp-device 0003:046D:400A.0007: input,hidraw4: USB HID v1.11 Mouse [Logitech M325] on usb-0000:c1:00.3-1/input2:1
[ 1330.369395] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
[ 1377.511427] usb 1-1: USB disconnect, device number 4
[ 4108.026412] usb 1-1: new full-speed USB device number 5 using xhci_hcd
[ 4108.179471] usb 1-1: New USB device found, idVendor=046d, idProduct=c52b, bcdDevice=12.10
[ 4108.179480] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 4108.179483] usb 1-1: Product: USB Receiver
[ 4108.179485] usb 1-1: Manufacturer: Logitech
[ 4108.228688] logitech-djreceiver 0003:046D:C52B.000A: hiddev96,hidraw3: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1/input2
[ 4108.343949] input: Logitech M325 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.2/0003:046D:C52B.000A/0003:046D:400A.000B/input/input25
[ 4108.344221] logitech-hidpp-device 0003:046D:400A.000B: input,hidraw4: USB HID v1.11 Mouse [Logitech M325] on usb-0000:c1:00.3-1/input2:1
[ 4111.677534] logitech-hidpp-device 0003:046D:400A.000B: HID++ 2.0 device connected.
[ 4638.871682] usb 7-1: new full-speed USB device number 2 using xhci_hcd
[ 4639.036190] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[ 4639.036200] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 4639.036202] usb 7-1: Product: DisplayPort Expansion Card
[ 4639.036205] usb 7-1: Manufacturer: Framework
[ 4639.036206] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[ 4639.124500] hid-generic 0003:32AC:0003.000C: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[ 4672.780897] usb 7-1: USB disconnect, device number 2
[ 4674.766847] usb 7-1: new full-speed USB device number 3 using xhci_hcd
[ 4674.930522] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[ 4674.930531] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 4674.930534] usb 7-1: Product: DisplayPort Expansion Card
[ 4674.930536] usb 7-1: Manufacturer: Framework
[ 4674.930538] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[ 4674.995859] hid-generic 0003:32AC:0003.000D: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[ 4706.679767] usb 1-1: USB disconnect, device number 5
[ 5064.421269] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 5064.698042] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 5073.140594] usb 1-1: new high-speed USB device number 6 using xhci_hcd
[ 5073.269387] usb 1-1: New USB device found, idVendor=214b, idProduct=7250, bcdDevice= 1.00
[ 5073.269394] usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[ 5073.269396] usb 1-1: Product: USB2.0 HUB
[ 5073.307561] hub 1-1:1.0: USB hub found
[ 5073.307885] hub 1-1:1.0: 4 ports detected
[ 5073.580793] usb 1-1.3: new full-speed USB device number 7 using xhci_hcd
[ 5073.665076] usb 1-1.3: not running at top speed; connect to a high speed hub
[ 5073.679387] usb 1-1.3: New USB device found, idVendor=320f, idProduct=5064, bcdDevice= 1.01
[ 5073.679394] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 5073.679397] usb 1-1.3: Product: USB DEVICE
[ 5073.679399] usb 1-1.3: Manufacturer: SONIX
[ 5073.760661] input: SONIX USB DEVICE as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.0/0003:320F:5064.000E/input/input26
[ 5073.813125] hid-generic 0003:320F:5064.000E: input,hidraw3: USB HID v1.11 Keyboard [SONIX USB DEVICE] on usb-0000:c1:00.3-1.3/input0
[ 5073.818170] input: SONIX USB DEVICE Keyboard as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.1/0003:320F:5064.000F/input/input27
[ 5073.871531] input: SONIX USB DEVICE as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.1/0003:320F:5064.000F/input/input28
[ 5073.871810] input: SONIX USB DEVICE Mouse as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.1/0003:320F:5064.000F/input/input29
[ 5073.872263] hid-generic 0003:320F:5064.000F: input,hiddev96,hidraw4: USB HID v1.11 Keyboard [SONIX USB DEVICE] on usb-0000:c1:00.3-1.3/input1
[ 5073.937091] usb 1-1.4: new full-speed USB device number 8 using xhci_hcd
[ 5074.038019] usb 1-1.4: New USB device found, idVendor=046d, idProduct=c53f, bcdDevice=44.01
[ 5074.038027] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 5074.038029] usb 1-1.4: Product: USB Receiver
[ 5074.038031] usb 1-1.4: Manufacturer: Logitech
[ 5074.148613] logitech-djreceiver 0003:046D:C53F.0010: hidraw6: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:c1:00.3-1.4/input0
[ 5074.205451] logitech-djreceiver 0003:046D:C53F.0011: hiddev98,hidraw7: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:c1:00.3-1.4/input1
[ 5074.260911] logitech-djreceiver 0003:046D:C53F.0012: hiddev99,hidraw8: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1.4/input2
[ 5074.318089] logitech-djreceiver 0003:046D:C53F.0012: device of type eQUAD Lightspeed 1.1 (0x0d) connected on slot 1
[ 5074.323757] input: Logitech G305 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.4/1-1.4:1.2/0003:046D:C53F.0012/0003:046D:4074.0013/input/input30
[ 5074.324195] logitech-hidpp-device 0003:046D:4074.0013: input,hidraw9: USB HID v1.11 Keyboard [Logitech G305] on usb-0000:c1:00.3-1.4/input2:1
[ 5091.192576] logitech-hidpp-device 0003:046D:4074.0013: HID++ 4.2 device connected.
[ 5119.346472] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ 6341.557015] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 6341.833518] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 7558.399327] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ 7738.585866] warning: `ThreadPoolForeg' uses wireless extensions which will stop working for Wi-Fi 7 hardware; use nl80211
[ 8851.038368] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 8851.301359] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[11714.333622] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[15211.154442] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[15211.417760] usb 1-4: reset full-speed USB device number 2 using xhci_hcd

I have a couple of tabs of firefox + brave open as well as KiCad. Yet the amdgpu_top says my VRAM usage is at 3.7GB/4.0GB
Could it be the low VRAM issue? If so, is there an official way to dedicate more VRAM? I’m aware of unofficial solutions mentioned here, but I don’t want to risk messing around with the bios.

Or, do I have to use the scatter-gather workaround still?


EDIT1:
I saw a couple of other corrupted patterns:


As I’m editing this post, the 2-3 lines at the bottom of the screen flicker with the corruption similar to the pictures I added.

I think in the moments that the screen randomly turns black and come back is when the driver/gpu preventing such corruption to happen? I speculate this as the corruptions would go back to normal after one of those moments.

Looking at the dmesg now, I see some errors. Though I’m not sure if they’re related to this exact issue:

[15503.451730] usb 7-1: USB disconnect, device number 3
[15505.077877] usb 7-1: new full-speed USB device number 4 using xhci_hcd
[15505.243249] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[15505.243257] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[15505.243259] usb 7-1: Product: DisplayPort Expansion Card
[15505.243261] usb 7-1: Manufacturer: Framework
[15505.243263] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[15505.329685] hid-generic 0003:32AC:0003.0014: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[15506.192355] usb 7-1: USB disconnect, device number 4
[15506.365696] ucsi_acpi USBC000:00: UCSI_GET_PDOS failed (-70)
[15506.445258] ucsi_acpi USBC000:00: UCSI_GET_PDOS failed (-70)
[15508.294420] usb 7-1: new full-speed USB device number 5 using xhci_hcd
[15508.457995] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[15508.458005] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[15508.458008] usb 7-1: Product: DisplayPort Expansion Card
[15508.458010] usb 7-1: Manufacturer: Framework
[15508.458012] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[15508.521930] hid-generic 0003:32AC:0003.0015: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[24747.900737] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[24748.163722] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[30649.271188] [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
[30649.834821] [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
[30650.458181] [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
[31233.767458] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[31233.781873] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[31233.810044] wlp1s0: deauthenticating from 84:17:ef:71:4b:a2 by local choice (Reason: 3=DEAUTH_LEAVING)
[31234.404999] PM: suspend entry (s2idle)
[31234.424272] Filesystems sync: 0.019 seconds
[31234.432992] Freezing user space processes
[31234.437647] Freezing user space processes completed (elapsed 0.004 seconds)
[31234.437650] OOM killer disabled.
[31234.437651] Freezing remaining freezable tasks
[31234.439275] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[31234.439278] printk: Suspending console(s) (use no_console_suspend to debug)
[31234.446370] atkbd serio0: Disabling IRQ1 wakeup source to avoid platform firmware bug
[31234.550675] PM: suspend devices took 0.112 seconds
[31234.551334] pcieport 0000:00:08.3: quirk: disabling D3cold for suspend
[31234.552319] ACPI: EC: interrupt blocked
[34344.574429] amd_pmc AMDI0009:00: Last suspend didn't reach deepest state
[34344.646646] ACPI: EC: interrupt unblocked
[34344.896244] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[34344.896305] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[34344.897691] nvme nvme0: Shutdown timeout set to 10 seconds
[34344.899100] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[34344.901553] nvme nvme0: 16/0/0 default/read/poll queues
[34344.930833] ------------[ cut here ]------------
[34344.930834] WARNING: CPU: 8 PID: 43686 at drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:1532 dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931100] Modules linked in: hid_logitech_hidpp hid_logitech_dj uinput rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc binfmt_misc vfat fat snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof mt7921e snd_hda_codec_realtek mt7921_common snd_sof_utils mt792x_lib snd_hda_codec_hdmi snd_hda_codec_generic snd_soc_core mt76_connac_lib intel_rapl_msr snd_hda_intel mt76 snd_intel_dspcfg intel_rapl_common snd_intel_sdw_acpi snd_hda_codec snd_compress edac_mce_amd ac97_bus snd_pcm_dmaengine snd_hda_core mac80211 kvm_amd btusb snd_pci_ps snd_hwdep btrtl hid_sensor_als snd_rpl_pci_acp6x snd_seq snd_acp_pci btintel hid_sensor_trigger snd_seq_device hid_sensor_iio_common btbcm snd_acp_legacy_common libarc4 btmtk
[34344.931150]  snd_pci_acp6x kvm snd_pcm bluetooth cfg80211 industrialio_triggered_buffer irqbypass kfifo_buf cros_ec_lpcs snd_timer snd_pci_acp5x snd_rn_pci_acp3x industrialio wmi_bmof cros_ec snd rapl amd_pmf pcspkr snd_acp_config thunderbolt snd_soc_acpi amdtee soundcore i2c_piix4 snd_pci_acp3x k10temp rfkill amd_sfh tee platform_profile amd_pmc joydev loop nfnetlink zram dm_crypt amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched crct10dif_pclmul nvme crc32_pclmul drm_suballoc_helper crc32c_intel drm_buddy polyval_clmulni polyval_generic nvme_core drm_display_helper video hid_multitouch ucsi_acpi ghash_clmulni_intel hid_sensor_hub sha512_ssse3 sha256_ssse3 typec_ucsi sha1_ssse3 sp5100_tco ccp cec typec nvme_auth wmi i2c_hid_acpi i2c_hid serio_raw ip6_tables ip_tables fuse
[34344.931200] CPU: 8 PID: 43686 Comm: kworker/u32:69 Not tainted 6.8.9-300.fc40.x86_64 #1
[34344.931203] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.05 03/29/2024
[34344.931204] Workqueue: events_unbound async_run_entry_fn
[34344.931209] RIP: 0010:dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931435] Code: 48 21 c8 48 c1 e2 38 48 09 d0 48 89 85 98 02 00 00 f6 85 c4 02 00 00 02 74 42 e8 7a eb ff ff 84 c0 75 39 48 8b 85 d8 01 00 00 <0f> 0b c6 85 9c 02 00 00 80 48 8b 40 10 48 8b 30 48 85 f6 74 04 48
[34344.931437] RSP: 0018:ffffb8ae0ac67bb0 EFLAGS: 00010246
[34344.931440] RAX: ffff8cf8818e0800 RBX: 00000000ffffffff RCX: 00ffffffffffffff
[34344.931441] RDX: 0000000000000007 RSI: ffffb8ae0ac67bb0 RDI: 0000000000000000
[34344.931443] RBP: ffff8cf88773f000 R08: ffff8cf880f60d20 R09: 00000000000f0000
[34344.931444] R10: 0000000000000000 R11: ffff8d0ec1e21780 R12: ffff8cf88773f000
[34344.931445] R13: ffff8cf89a940000 R14: ffff8cf89a940018 R15: ffffb8ae0ac67be7
[34344.931446] FS:  0000000000000000(0000) GS:ffff8d0ec1e00000(0000) knlGS:0000000000000000
[34344.931447] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34344.931449] CR2: 00007fbccced6146 CR3: 00000009f0428000 CR4: 0000000000f50ef0
[34344.931450] PKRU: 55555554
[34344.931451] Call Trace:
[34344.931454]  <TASK>
[34344.931456]  ? dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931656]  ? __warn+0x81/0x130
[34344.931661]  ? dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931853]  ? report_bug+0x16f/0x1a0
[34344.931858]  ? handle_bug+0x3c/0x80
[34344.931860]  ? exc_invalid_op+0x17/0x70
[34344.931862]  ? asm_exc_invalid_op+0x1a/0x20
[34344.931868]  ? dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.932054]  link_blank_all_dp_displays+0x9b/0x1a0 [amdgpu]
[34344.932259]  dcn31_init_hw+0x1e0/0x990 [amdgpu]
[34344.932476]  dc_set_power_state+0x67/0xb0 [amdgpu]
[34344.932663]  dm_resume+0x10f/0xb00 [amdgpu]
[34344.932806]  ? srso_alias_return_thunk+0x5/0xfbef5
[34344.932808]  ? _dev_info+0x77/0xa0
[34344.932811]  amdgpu_device_ip_resume_phase2+0xa0/0x1d0 [amdgpu]
[34344.932903]  amdgpu_device_resume+0xa0/0x2c0 [amdgpu]
[34344.932995]  ? __pfx_pci_pm_resume+0x10/0x10
[34344.932998]  amdgpu_pmops_resume+0x4a/0x80 [amdgpu]
[34344.933088]  ? __pfx_pci_pm_resume+0x10/0x10
[34344.933089]  dpm_run_callback+0x89/0x1e0
[34344.933092]  device_resume+0xb3/0x300
[34344.933094]  async_resume+0x1d/0x30
[34344.933095]  async_run_entry_fn+0x31/0x130
[34344.933097]  process_one_work+0x16f/0x330
[34344.933100]  worker_thread+0x273/0x3c0
[34344.933102]  ? __pfx_worker_thread+0x10/0x10
[34344.933104]  kthread+0xe5/0x120
[34344.933106]  ? __pfx_kthread+0x10/0x10
[34344.933107]  ret_from_fork+0x31/0x50
[34344.933109]  ? __pfx_kthread+0x10/0x10
[34344.933111]  ret_from_fork_asm+0x1b/0x30
[34344.933114]  </TASK>
[34344.933115] ---[ end trace 0000000000000000 ]---
[34345.059691] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[34345.060598] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[34345.060896] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[34345.060899] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[34345.060901] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[34345.060903] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[34345.060905] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[34345.060906] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[34345.060908] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[34345.060909] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[34345.060911] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[34345.060913] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[34345.060914] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[34345.060916] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[34345.060917] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[34345.066058] [drm] ring gfx_32792.1.1 was added
[34345.066647] [drm] ring compute_32792.2.2 was added
[34345.067178] [drm] ring sdma_32792.3.3 was added
[34345.067204] [drm] ring gfx_32792.1.1 ib test pass
[34345.067231] [drm] ring compute_32792.2.2 ib test pass
[34345.067343] [drm] ring sdma_32792.3.3 ib test pass
[34345.086065] usb 1-1: reset high-speed USB device number 6 using xhci_hcd
[34345.492127] [drm:retrieve_link_cap [amdgpu]] *ERROR* retrieve_link_cap: Read receiver caps dpcd data failed.
[34345.566335] usb 1-1.3: reset full-speed USB device number 7 using xhci_hcd
[34345.726066] usb 1-1.4: reset full-speed USB device number 8 using xhci_hcd
[34345.848498] PM: resume devices took 0.956 seconds
[34345.848729] OOM killer enabled.
[34345.848730] Restarting tasks ... done.
[34345.852495] random: crng reseeded on system resumption
[34345.853297] PM: suspend exit
[34346.118442] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[34346.391384] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[34351.490123] wlp1s0: authenticate with 84:17:ef:71:4b:a2 (local address=1a:4a:16:51:e0:bc)
[34351.501826] wlp1s0: send auth to 84:17:ef:71:4b:a2 (try 1/3)
[34351.504435] wlp1s0: authenticated
[34351.506251] wlp1s0: associate with 84:17:ef:71:4b:a2 (try 1/3)
[34351.518292] wlp1s0: RX AssocResp from 84:17:ef:71:4b:a2 (capab=0x1511 status=0 aid=6)
[34351.542294] wlp1s0: associated
[34351.654285] wlp1s0: Limiting TX power to 30 (30 - 0) dBm as advertised by 84:17:ef:71:4b:a2
[35761.095708] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[35761.358131] usb 1-4: reset full-speed USB device number 2 using xhci_hcd

I have the same issue, but only when doing GPU intensive tasks at larger resolutions. Just simple webbrowsing and watching Videos works just fine on my external 4k@144hz.

But unlike your example with the glitches, my screen just stays off and everything hangs forever. I assume the hang is caused by a different bug related to my dock that I’ve mentioned in another thread. Any of this happens with and without gaming mode in bios.

[ 1788.822528] gmc_v11_0_process_interrupt: 55 callbacks suppressed
[ 1788.822545] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32772, for process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369)
[ 1788.822558] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x0000aaab42697000 from client 10
[ 1788.822564] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
[ 1788.822568] amdgpu 0000:c1:00.0: amdgpu: 	 Faulty UTCL2 client ID: SQC (data) (0xa)
[ 1788.822572] amdgpu 0000:c1:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 1788.822576] amdgpu 0000:c1:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1788.822579] amdgpu 0000:c1:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 1788.822582] amdgpu 0000:c1:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1788.822585] amdgpu 0000:c1:00.0: amdgpu: 	 RW: 0x0
[ 1788.822599] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32772, for process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369)
[ 1788.822606] amdgpu 0000:c1:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
[ 1788.822610] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1788.822614] amdgpu 0000:c1:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[ 1788.822617] amdgpu 0000:c1:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 1788.822620] amdgpu 0000:c1:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1788.822623] amdgpu 0000:c1:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 1788.822626] amdgpu 0000:c1:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1788.822629] amdgpu 0000:c1:00.0: amdgpu: 	 RW: 0x0
[ 1799.255307] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=396987, emitted seq=396989
[ 1799.255551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369
[ 1799.255832] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 1799.562697] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1799.563172] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1799.719984] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1799.720449] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1799.876983] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1799.877422] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.033315] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.033760] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.189975] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.190420] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.346518] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.346954] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.503116] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.503559] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.657708] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.658401] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.805842] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.806540] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1801.073811] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 1801.075635] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 1801.108338] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1801.108964] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[ 1801.109188] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 1801.111200] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 1801.113332] [drm] DMUB hardware initialized: version=0x08003700
[ 1806.235965] thunderbolt 0000:c3:00.6: 0:4 <-> 2:16 (USB3): failed to calculate available bandwidth
[ 1811.534132] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1821.773281] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1832.013311] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1842.253284] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1852.494275] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1855.512089] usb 7-1: USB disconnect, device number 2
[ 1855.512096] usb 7-1.1: USB disconnect, device number 4
[ 1855.512100] usb 7-1.1.1: USB disconnect, device number 6
[ 1855.512102] usb 7-1.1.1.3: USB disconnect, device number 8
[ 1855.513557] pcieport 0000:00:04.1: pciehp: Slot(0-1): Link Down
[ 1855.513563] pcieport 0000:00:04.1: pciehp: Slot(0-1): Card not present
[ 1855.513628] pcieport 0000:00:04.1: PME: Spurious native interrupt!
[ 1855.513642] pcieport 0000:00:04.1: PME: Spurious native interrupt!
[ 1855.513847] igc 0000:c0:00.0 enp192s0: PHC removed
[ 1855.513979] igc 0000:c0:00.0 enp192s0: PCIe link lost, device now detached
[ 1855.517195] thunderbolt 1-2: device disconnected
[ 1855.585544] pcieport 0000:63:03.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.587006] pcieport 0000:63:03.0: Runtime PM usage count underflow!
[ 1855.587039] pcieport 0000:63:02.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.587569] pcieport 0000:63:02.0: Runtime PM usage count underflow!
[ 1855.587589] pcieport 0000:63:01.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.588078] pcieport 0000:63:01.0: Runtime PM usage count underflow!
[ 1855.588095] pcieport 0000:63:00.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.588383] pci_bus 0000:64: busn_res: [bus 64] is released
[ 1855.589672] pci_bus 0000:65: busn_res: [bus 65-83] is released
[ 1855.590222] pci_bus 0000:84: busn_res: [bus 84-a2] is released
[ 1855.590704] pci_bus 0000:a3: busn_res: [bus a3-bf] is released
[ 1855.591129] pci_bus 0000:c0: busn_res: [bus c0] is released
[ 1855.591350] pci_bus 0000:63: busn_res: [bus 63-c0] is released
[ 1855.773422] usb 8-1: USB disconnect, device number 2
[ 1855.773433] usb 8-1.4: USB disconnect, device number 3
[ 1855.773437] usb 8-1.4.1: USB disconnect, device number 4
[ 1855.781009] usb 8-1.4.2: USB disconnect, device number 5
[ 1855.876724] usb 7-1.1.2: USB disconnect, device number 7
[ 1855.876732] usb 7-1.1.2.2: USB disconnect, device number 9
[ 1856.085155] usb 7-1.1.2.5: USB disconnect, device number 10
[ 1856.181073] usb 7-1.1.5: USB disconnect, device number 5
[ 1856.244964] usb 7-1.3: USB disconnect, device number 3
[ 1862.733333] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1872.973294] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1873.613289] pcieport 0000:00:08.3: PME: Spurious native interrupt!
[ 1883.213307] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1893.453312] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1903.693571] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1913.933455] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1924.173296] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1934.413285] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1944.653303] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1954.893484] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!

I brought back the amdgpu.sg_display=0 workaround and have had more stable experience. I have seen only one momentary black out and one instance of corruption which covered the full screen of the 4K monitor after a reboot for installing updates.

My flickering was resolved and has been caused by a faulty display kit. One issue less, now lets see how long it takes to fix the crash/hang.

1 Like

AMD GPU Driver Crash

Summary: I found a way to reliably reproduce the crash/hang in a deterministic way and wrote a guide at the bottom of this post.

Background

When running specific GPU loads, my system crashes sometimes. I managed to log the kernel and every single time I get this page fault:

amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32772, for process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369)
amdgpu:   in page starting at address 0x000000003f800000 from client 10
amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
amdgpu: 	 MORE_FAULTS: 0x0
amdgpu: 	 WALKER_ERROR: 0x0
amdgpu: 	 PERMISSION_FAULTS: 0x0
amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu: 	 RW: 0x0

This issue occurs frequently when running 3d workloads inside a Qemu Guest VM. I was not able to reproduce this issue by running the same workload outside the VM. But this is not a Qemu bug, because the crash itself happens on the host system. And guests should never be able to crash their host.

System

Framework 13
AMD Ryzen 7840U
64 GB Memory
BIOS: 03.05

Configuration 1 Configuration 2
Host Fedora 40 Ubuntu 24.04 LTS (Live Environment)
Host Kernel 6.8.10-300.fc40
6.8.11-300.fc40
6.8.?? (not completely sure)
Guest Alpine 3.19.1 Fedora 39
Crash Behavior Screen(s) turn black
forever until poweroff
is forced
Full system freeze for 10 - 20 seconds.
Then it normalizes, until the still running
workload triggers it again

It makes no difference whether I run on battery or with the charger. Disabling/enabling amdgpu.sg_display=0 and/or GamingMode in the BIOS makes no difference either.

Guide To Reproduce The Issue

Time required: ~15 Minutes
Example OS: Ubuntu 24.04 LTS (Live Environment)

1. Setup a VM with 3d acceleration

  1. Install gnome-boxes trough the software app
  2. Open gnome-boxes and click the :heavy_plus_sign: on the top left corner to download an OS
  3. Select Fedora 39 and do a full installation. Just testing in the live image itself is not enough
  4. Reboot Fedora 39. Gnome-boxes may not autostart the VM, but you can do this by double-clicking on the VMs icon. Gnome-boxes sometimes crashes here, but just start it again and retry
  5. Gnome-boxes always boots into the live-image/installer, not the system which you just installed. So you have to interrupt GRUB and move down to Troubleshooting. There you can select the first partition. Note: You have to do this on every VM reboot to prevent starting the live-image/installer again
  6. Update Fedora 39 trough the software center. I don’t know if this is required to reproduce the crash, but I did it anyways. Note that you don’t need a full upgrade to Fedora 40, just the basic updates are enough. Then reboot if the updater tells you
  7. Shut down the VM
  8. Gnome-boxes has a 3d acceleration setting in the VMs properties, but unfortunately this does not work. Therefore we will boot our freshly installed Fedora 39 image with Virt-Manager
  9. sudo apt install virt-manager
  10. Start Virt-Manager. If it complains about not connecting to Qemu’s system session, ignore it. Create a new session and select “user session”
  11. In the menu where you can create a new VM, there is an option to import an existing VM image. Select it. The image should be located somewhere in ~/snap/gnome-boxes/...
  12. It asks you for an OS name. Enter Fedora 39
  13. Select “configure VM after creation” (or something like that)
  14. You now need to enable several GPU-related settings. I don’t know which order is the right one, but Virt-Manager will tell you. Settings to enable:
  • Memory
    • Enable shared memory: :heavy_check_mark:
  • Video Virtio
    • Model: Virtio
    • 3d acceleration: :heavy_check_mark:
  • Display Spice
    • Listen type: None
    • OpenGL: :heavy_check_mark:
  1. Press the button on the top left of this window which says something like “complete installation”

2. Trigger the crash

  1. Boot your VM and open Firefox
  2. Search for basemark webgl and it will send you to this page: https://web.basemark.com/
  3. Run the benchmark. The first 6 tests will pass, but on test 7/20 is where the crash happens

Note: When I’ve tested this on two different Linux distributions, there where some differences in behaviour. On the Fedora 40 host the crash happens within seconds. On the Ubuntu Live host it took around a minute.

2 Likes