I can no longer reproduce the graphical corruption after upgrading to Fedora 40 and the BIOS.
I’m on Fedora 40 with kernel 6.8.9-300.fc40.x86_64 and GNOME 46.1
Latest BIOS, 7840U with 96GB of ram.
My software packages are all updated as of now. I recently removed the amdgpu.sg_display=0
kernel command line. So far it has been fine as I was using the laptop display itself. I have set the VRAM allocation to gaming (so with 96GB ram I get 4GB allocated)
Today, I plugged in my 4K monitor (Samsung - Odyssey Neo G7 43") running at 120Hz via display port module (through usb-1 from the image here) with the lid of the laptop closed and the charger plugged-in (via usb-3)
The display kept turning off momentarily and came back on at random times, ranging from ~30 seconds to ~half an hour.
One of these times, I saw rainbow like glitches on the bottom half of the screen. Fortunately, unplugging and plugging back the display port cable recovered the whole screen and I didn’t lose my work.
I checked kernel logs with dmesg and didn’t see any errors.
...
[ 1303.450636] input: Logitech USB Receiver as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.0/0003:046D:C52B.0004/input/input15
[ 1303.503494] hid-generic 0003:046D:C52B.0004: input,hidraw3: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:c1:00.3-1/input0
[ 1303.509235] input: Logitech USB Receiver Mouse as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.1/0003:046D:C52B.0005/input/input16
[ 1303.509509] input: Logitech USB Receiver Consumer Control as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.1/0003:046D:C52B.0005/input/input17
[ 1303.561695] input: Logitech USB Receiver System Control as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.1/0003:046D:C52B.0005/input/input18
[ 1303.562020] hid-generic 0003:046D:C52B.0005: input,hiddev96,hidraw4: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:c1:00.3-1/input1
[ 1303.566694] hid-generic 0003:046D:C52B.0006: hiddev97,hidraw5: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1/input2
[ 1303.704109] logitech-djreceiver 0003:046D:C52B.0006: hiddev96,hidraw3: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1/input2
[ 1303.813151] input: Logitech Wireless Device PID:400a Mouse as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.2/0003:046D:C52B.0006/0003:046D:400A.0007/input/input20
[ 1303.813336] hid-generic 0003:046D:400A.0007: input,hidraw4: USB HID v1.11 Mouse [Logitech Wireless Device PID:400a] on usb-0000:c1:00.3-1/input2:1
[ 1303.883006] input: Logitech M325 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.2/0003:046D:C52B.0006/0003:046D:400A.0007/input/input24
[ 1303.883210] logitech-hidpp-device 0003:046D:400A.0007: input,hidraw4: USB HID v1.11 Mouse [Logitech M325] on usb-0000:c1:00.3-1/input2:1
[ 1330.369395] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
[ 1377.511427] usb 1-1: USB disconnect, device number 4
[ 4108.026412] usb 1-1: new full-speed USB device number 5 using xhci_hcd
[ 4108.179471] usb 1-1: New USB device found, idVendor=046d, idProduct=c52b, bcdDevice=12.10
[ 4108.179480] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 4108.179483] usb 1-1: Product: USB Receiver
[ 4108.179485] usb 1-1: Manufacturer: Logitech
[ 4108.228688] logitech-djreceiver 0003:046D:C52B.000A: hiddev96,hidraw3: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1/input2
[ 4108.343949] input: Logitech M325 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1:1.2/0003:046D:C52B.000A/0003:046D:400A.000B/input/input25
[ 4108.344221] logitech-hidpp-device 0003:046D:400A.000B: input,hidraw4: USB HID v1.11 Mouse [Logitech M325] on usb-0000:c1:00.3-1/input2:1
[ 4111.677534] logitech-hidpp-device 0003:046D:400A.000B: HID++ 2.0 device connected.
[ 4638.871682] usb 7-1: new full-speed USB device number 2 using xhci_hcd
[ 4639.036190] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[ 4639.036200] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 4639.036202] usb 7-1: Product: DisplayPort Expansion Card
[ 4639.036205] usb 7-1: Manufacturer: Framework
[ 4639.036206] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[ 4639.124500] hid-generic 0003:32AC:0003.000C: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[ 4672.780897] usb 7-1: USB disconnect, device number 2
[ 4674.766847] usb 7-1: new full-speed USB device number 3 using xhci_hcd
[ 4674.930522] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[ 4674.930531] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 4674.930534] usb 7-1: Product: DisplayPort Expansion Card
[ 4674.930536] usb 7-1: Manufacturer: Framework
[ 4674.930538] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[ 4674.995859] hid-generic 0003:32AC:0003.000D: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[ 4706.679767] usb 1-1: USB disconnect, device number 5
[ 5064.421269] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 5064.698042] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 5073.140594] usb 1-1: new high-speed USB device number 6 using xhci_hcd
[ 5073.269387] usb 1-1: New USB device found, idVendor=214b, idProduct=7250, bcdDevice= 1.00
[ 5073.269394] usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[ 5073.269396] usb 1-1: Product: USB2.0 HUB
[ 5073.307561] hub 1-1:1.0: USB hub found
[ 5073.307885] hub 1-1:1.0: 4 ports detected
[ 5073.580793] usb 1-1.3: new full-speed USB device number 7 using xhci_hcd
[ 5073.665076] usb 1-1.3: not running at top speed; connect to a high speed hub
[ 5073.679387] usb 1-1.3: New USB device found, idVendor=320f, idProduct=5064, bcdDevice= 1.01
[ 5073.679394] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 5073.679397] usb 1-1.3: Product: USB DEVICE
[ 5073.679399] usb 1-1.3: Manufacturer: SONIX
[ 5073.760661] input: SONIX USB DEVICE as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.0/0003:320F:5064.000E/input/input26
[ 5073.813125] hid-generic 0003:320F:5064.000E: input,hidraw3: USB HID v1.11 Keyboard [SONIX USB DEVICE] on usb-0000:c1:00.3-1.3/input0
[ 5073.818170] input: SONIX USB DEVICE Keyboard as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.1/0003:320F:5064.000F/input/input27
[ 5073.871531] input: SONIX USB DEVICE as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.1/0003:320F:5064.000F/input/input28
[ 5073.871810] input: SONIX USB DEVICE Mouse as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.3/1-1.3:1.1/0003:320F:5064.000F/input/input29
[ 5073.872263] hid-generic 0003:320F:5064.000F: input,hiddev96,hidraw4: USB HID v1.11 Keyboard [SONIX USB DEVICE] on usb-0000:c1:00.3-1.3/input1
[ 5073.937091] usb 1-1.4: new full-speed USB device number 8 using xhci_hcd
[ 5074.038019] usb 1-1.4: New USB device found, idVendor=046d, idProduct=c53f, bcdDevice=44.01
[ 5074.038027] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 5074.038029] usb 1-1.4: Product: USB Receiver
[ 5074.038031] usb 1-1.4: Manufacturer: Logitech
[ 5074.148613] logitech-djreceiver 0003:046D:C53F.0010: hidraw6: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:c1:00.3-1.4/input0
[ 5074.205451] logitech-djreceiver 0003:046D:C53F.0011: hiddev98,hidraw7: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:c1:00.3-1.4/input1
[ 5074.260911] logitech-djreceiver 0003:046D:C53F.0012: hiddev99,hidraw8: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:c1:00.3-1.4/input2
[ 5074.318089] logitech-djreceiver 0003:046D:C53F.0012: device of type eQUAD Lightspeed 1.1 (0x0d) connected on slot 1
[ 5074.323757] input: Logitech G305 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.3/usb1/1-1/1-1.4/1-1.4:1.2/0003:046D:C53F.0012/0003:046D:4074.0013/input/input30
[ 5074.324195] logitech-hidpp-device 0003:046D:4074.0013: input,hidraw9: USB HID v1.11 Keyboard [Logitech G305] on usb-0000:c1:00.3-1.4/input2:1
[ 5091.192576] logitech-hidpp-device 0003:046D:4074.0013: HID++ 4.2 device connected.
[ 5119.346472] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ 6341.557015] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 6341.833518] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 7558.399327] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ 7738.585866] warning: `ThreadPoolForeg' uses wireless extensions which will stop working for Wi-Fi 7 hardware; use nl80211
[ 8851.038368] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 8851.301359] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[11714.333622] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[15211.154442] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[15211.417760] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
I have a couple of tabs of firefox + brave open as well as KiCad. Yet the amdgpu_top says my VRAM usage is at 3.7GB/4.0GB
Could it be the low VRAM issue? If so, is there an official way to dedicate more VRAM? I’m aware of unofficial solutions mentioned here, but I don’t want to risk messing around with the bios.
Or, do I have to use the scatter-gather workaround still?
EDIT1:
I saw a couple of other corrupted patterns:
As I’m editing this post, the 2-3 lines at the bottom of the screen flicker with the corruption similar to the pictures I added.
I think in the moments that the screen randomly turns black and come back is when the driver/gpu preventing such corruption to happen? I speculate this as the corruptions would go back to normal after one of those moments.
Looking at the dmesg now, I see some errors. Though I’m not sure if they’re related to this exact issue:
[15503.451730] usb 7-1: USB disconnect, device number 3
[15505.077877] usb 7-1: new full-speed USB device number 4 using xhci_hcd
[15505.243249] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[15505.243257] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[15505.243259] usb 7-1: Product: DisplayPort Expansion Card
[15505.243261] usb 7-1: Manufacturer: Framework
[15505.243263] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[15505.329685] hid-generic 0003:32AC:0003.0014: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[15506.192355] usb 7-1: USB disconnect, device number 4
[15506.365696] ucsi_acpi USBC000:00: UCSI_GET_PDOS failed (-70)
[15506.445258] ucsi_acpi USBC000:00: UCSI_GET_PDOS failed (-70)
[15508.294420] usb 7-1: new full-speed USB device number 5 using xhci_hcd
[15508.457995] usb 7-1: New USB device found, idVendor=32ac, idProduct=0003, bcdDevice= 0.00
[15508.458005] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[15508.458008] usb 7-1: Product: DisplayPort Expansion Card
[15508.458010] usb 7-1: Manufacturer: Framework
[15508.458012] usb 7-1: SerialNumber: 11AD1D0083403F0D30260B00
[15508.521930] hid-generic 0003:32AC:0003.0015: hiddev97,hidraw5: USB HID v1.11 Device [Framework DisplayPort Expansion Card] on usb-0000:c3:00.4-1/input1
[24747.900737] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[24748.163722] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[30649.271188] [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
[30649.834821] [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
[30650.458181] [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
[31233.767458] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[31233.781873] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[31233.810044] wlp1s0: deauthenticating from 84:17:ef:71:4b:a2 by local choice (Reason: 3=DEAUTH_LEAVING)
[31234.404999] PM: suspend entry (s2idle)
[31234.424272] Filesystems sync: 0.019 seconds
[31234.432992] Freezing user space processes
[31234.437647] Freezing user space processes completed (elapsed 0.004 seconds)
[31234.437650] OOM killer disabled.
[31234.437651] Freezing remaining freezable tasks
[31234.439275] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[31234.439278] printk: Suspending console(s) (use no_console_suspend to debug)
[31234.446370] atkbd serio0: Disabling IRQ1 wakeup source to avoid platform firmware bug
[31234.550675] PM: suspend devices took 0.112 seconds
[31234.551334] pcieport 0000:00:08.3: quirk: disabling D3cold for suspend
[31234.552319] ACPI: EC: interrupt blocked
[34344.574429] amd_pmc AMDI0009:00: Last suspend didn't reach deepest state
[34344.646646] ACPI: EC: interrupt unblocked
[34344.896244] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[34344.896305] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[34344.897691] nvme nvme0: Shutdown timeout set to 10 seconds
[34344.899100] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[34344.901553] nvme nvme0: 16/0/0 default/read/poll queues
[34344.930833] ------------[ cut here ]------------
[34344.930834] WARNING: CPU: 8 PID: 43686 at drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:1532 dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931100] Modules linked in: hid_logitech_hidpp hid_logitech_dj uinput rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc binfmt_misc vfat fat snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof mt7921e snd_hda_codec_realtek mt7921_common snd_sof_utils mt792x_lib snd_hda_codec_hdmi snd_hda_codec_generic snd_soc_core mt76_connac_lib intel_rapl_msr snd_hda_intel mt76 snd_intel_dspcfg intel_rapl_common snd_intel_sdw_acpi snd_hda_codec snd_compress edac_mce_amd ac97_bus snd_pcm_dmaengine snd_hda_core mac80211 kvm_amd btusb snd_pci_ps snd_hwdep btrtl hid_sensor_als snd_rpl_pci_acp6x snd_seq snd_acp_pci btintel hid_sensor_trigger snd_seq_device hid_sensor_iio_common btbcm snd_acp_legacy_common libarc4 btmtk
[34344.931150] snd_pci_acp6x kvm snd_pcm bluetooth cfg80211 industrialio_triggered_buffer irqbypass kfifo_buf cros_ec_lpcs snd_timer snd_pci_acp5x snd_rn_pci_acp3x industrialio wmi_bmof cros_ec snd rapl amd_pmf pcspkr snd_acp_config thunderbolt snd_soc_acpi amdtee soundcore i2c_piix4 snd_pci_acp3x k10temp rfkill amd_sfh tee platform_profile amd_pmc joydev loop nfnetlink zram dm_crypt amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched crct10dif_pclmul nvme crc32_pclmul drm_suballoc_helper crc32c_intel drm_buddy polyval_clmulni polyval_generic nvme_core drm_display_helper video hid_multitouch ucsi_acpi ghash_clmulni_intel hid_sensor_hub sha512_ssse3 sha256_ssse3 typec_ucsi sha1_ssse3 sp5100_tco ccp cec typec nvme_auth wmi i2c_hid_acpi i2c_hid serio_raw ip6_tables ip_tables fuse
[34344.931200] CPU: 8 PID: 43686 Comm: kworker/u32:69 Not tainted 6.8.9-300.fc40.x86_64 #1
[34344.931203] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.05 03/29/2024
[34344.931204] Workqueue: events_unbound async_run_entry_fn
[34344.931209] RIP: 0010:dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931435] Code: 48 21 c8 48 c1 e2 38 48 09 d0 48 89 85 98 02 00 00 f6 85 c4 02 00 00 02 74 42 e8 7a eb ff ff 84 c0 75 39 48 8b 85 d8 01 00 00 <0f> 0b c6 85 9c 02 00 00 80 48 8b 40 10 48 8b 30 48 85 f6 74 04 48
[34344.931437] RSP: 0018:ffffb8ae0ac67bb0 EFLAGS: 00010246
[34344.931440] RAX: ffff8cf8818e0800 RBX: 00000000ffffffff RCX: 00ffffffffffffff
[34344.931441] RDX: 0000000000000007 RSI: ffffb8ae0ac67bb0 RDI: 0000000000000000
[34344.931443] RBP: ffff8cf88773f000 R08: ffff8cf880f60d20 R09: 00000000000f0000
[34344.931444] R10: 0000000000000000 R11: ffff8d0ec1e21780 R12: ffff8cf88773f000
[34344.931445] R13: ffff8cf89a940000 R14: ffff8cf89a940018 R15: ffffb8ae0ac67be7
[34344.931446] FS: 0000000000000000(0000) GS:ffff8d0ec1e00000(0000) knlGS:0000000000000000
[34344.931447] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34344.931449] CR2: 00007fbccced6146 CR3: 00000009f0428000 CR4: 0000000000f50ef0
[34344.931450] PKRU: 55555554
[34344.931451] Call Trace:
[34344.931454] <TASK>
[34344.931456] ? dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931656] ? __warn+0x81/0x130
[34344.931661] ? dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.931853] ? report_bug+0x16f/0x1a0
[34344.931858] ? handle_bug+0x3c/0x80
[34344.931860] ? exc_invalid_op+0x17/0x70
[34344.931862] ? asm_exc_invalid_op+0x1a/0x20
[34344.931868] ? dp_retrieve_lttpr_cap+0x121/0x1e0 [amdgpu]
[34344.932054] link_blank_all_dp_displays+0x9b/0x1a0 [amdgpu]
[34344.932259] dcn31_init_hw+0x1e0/0x990 [amdgpu]
[34344.932476] dc_set_power_state+0x67/0xb0 [amdgpu]
[34344.932663] dm_resume+0x10f/0xb00 [amdgpu]
[34344.932806] ? srso_alias_return_thunk+0x5/0xfbef5
[34344.932808] ? _dev_info+0x77/0xa0
[34344.932811] amdgpu_device_ip_resume_phase2+0xa0/0x1d0 [amdgpu]
[34344.932903] amdgpu_device_resume+0xa0/0x2c0 [amdgpu]
[34344.932995] ? __pfx_pci_pm_resume+0x10/0x10
[34344.932998] amdgpu_pmops_resume+0x4a/0x80 [amdgpu]
[34344.933088] ? __pfx_pci_pm_resume+0x10/0x10
[34344.933089] dpm_run_callback+0x89/0x1e0
[34344.933092] device_resume+0xb3/0x300
[34344.933094] async_resume+0x1d/0x30
[34344.933095] async_run_entry_fn+0x31/0x130
[34344.933097] process_one_work+0x16f/0x330
[34344.933100] worker_thread+0x273/0x3c0
[34344.933102] ? __pfx_worker_thread+0x10/0x10
[34344.933104] kthread+0xe5/0x120
[34344.933106] ? __pfx_kthread+0x10/0x10
[34344.933107] ret_from_fork+0x31/0x50
[34344.933109] ? __pfx_kthread+0x10/0x10
[34344.933111] ret_from_fork_asm+0x1b/0x30
[34344.933114] </TASK>
[34344.933115] ---[ end trace 0000000000000000 ]---
[34345.059691] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[34345.060598] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[34345.060896] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[34345.060899] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[34345.060901] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[34345.060903] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[34345.060905] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[34345.060906] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[34345.060908] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[34345.060909] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[34345.060911] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[34345.060913] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[34345.060914] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[34345.060916] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[34345.060917] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[34345.066058] [drm] ring gfx_32792.1.1 was added
[34345.066647] [drm] ring compute_32792.2.2 was added
[34345.067178] [drm] ring sdma_32792.3.3 was added
[34345.067204] [drm] ring gfx_32792.1.1 ib test pass
[34345.067231] [drm] ring compute_32792.2.2 ib test pass
[34345.067343] [drm] ring sdma_32792.3.3 ib test pass
[34345.086065] usb 1-1: reset high-speed USB device number 6 using xhci_hcd
[34345.492127] [drm:retrieve_link_cap [amdgpu]] *ERROR* retrieve_link_cap: Read receiver caps dpcd data failed.
[34345.566335] usb 1-1.3: reset full-speed USB device number 7 using xhci_hcd
[34345.726066] usb 1-1.4: reset full-speed USB device number 8 using xhci_hcd
[34345.848498] PM: resume devices took 0.956 seconds
[34345.848729] OOM killer enabled.
[34345.848730] Restarting tasks ... done.
[34345.852495] random: crng reseeded on system resumption
[34345.853297] PM: suspend exit
[34346.118442] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[34346.391384] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[34351.490123] wlp1s0: authenticate with 84:17:ef:71:4b:a2 (local address=1a:4a:16:51:e0:bc)
[34351.501826] wlp1s0: send auth to 84:17:ef:71:4b:a2 (try 1/3)
[34351.504435] wlp1s0: authenticated
[34351.506251] wlp1s0: associate with 84:17:ef:71:4b:a2 (try 1/3)
[34351.518292] wlp1s0: RX AssocResp from 84:17:ef:71:4b:a2 (capab=0x1511 status=0 aid=6)
[34351.542294] wlp1s0: associated
[34351.654285] wlp1s0: Limiting TX power to 30 (30 - 0) dBm as advertised by 84:17:ef:71:4b:a2
[35761.095708] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[35761.358131] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
I have the same issue, but only when doing GPU intensive tasks at larger resolutions. Just simple webbrowsing and watching Videos works just fine on my external 4k@144hz.
But unlike your example with the glitches, my screen just stays off and everything hangs forever. I assume the hang is caused by a different bug related to my dock that I’ve mentioned in another thread. Any of this happens with and without gaming mode in bios.
[ 1788.822528] gmc_v11_0_process_interrupt: 55 callbacks suppressed
[ 1788.822545] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32772, for process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369)
[ 1788.822558] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x0000aaab42697000 from client 10
[ 1788.822564] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
[ 1788.822568] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
[ 1788.822572] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x1
[ 1788.822576] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0
[ 1788.822579] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 1788.822582] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 1788.822585] amdgpu 0000:c1:00.0: amdgpu: RW: 0x0
[ 1788.822599] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32772, for process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369)
[ 1788.822606] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x000000003f800000 from client 10
[ 1788.822610] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1788.822614] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 1788.822617] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x0
[ 1788.822620] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0
[ 1788.822623] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 1788.822626] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 1788.822629] amdgpu 0000:c1:00.0: amdgpu: RW: 0x0
[ 1799.255307] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=396987, emitted seq=396989
[ 1799.255551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369
[ 1799.255832] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 1799.562697] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1799.563172] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1799.719984] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1799.720449] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1799.876983] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1799.877422] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.033315] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.033760] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.189975] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.190420] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.346518] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.346954] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.503116] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.503559] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.657708] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.658401] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1800.805842] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1800.806540] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1801.073811] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 1801.075635] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 1801.108338] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1801.108964] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[ 1801.109188] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 1801.111200] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 1801.113332] [drm] DMUB hardware initialized: version=0x08003700
[ 1806.235965] thunderbolt 0000:c3:00.6: 0:4 <-> 2:16 (USB3): failed to calculate available bandwidth
[ 1811.534132] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1821.773281] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1832.013311] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1842.253284] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1852.494275] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1855.512089] usb 7-1: USB disconnect, device number 2
[ 1855.512096] usb 7-1.1: USB disconnect, device number 4
[ 1855.512100] usb 7-1.1.1: USB disconnect, device number 6
[ 1855.512102] usb 7-1.1.1.3: USB disconnect, device number 8
[ 1855.513557] pcieport 0000:00:04.1: pciehp: Slot(0-1): Link Down
[ 1855.513563] pcieport 0000:00:04.1: pciehp: Slot(0-1): Card not present
[ 1855.513628] pcieport 0000:00:04.1: PME: Spurious native interrupt!
[ 1855.513642] pcieport 0000:00:04.1: PME: Spurious native interrupt!
[ 1855.513847] igc 0000:c0:00.0 enp192s0: PHC removed
[ 1855.513979] igc 0000:c0:00.0 enp192s0: PCIe link lost, device now detached
[ 1855.517195] thunderbolt 1-2: device disconnected
[ 1855.585544] pcieport 0000:63:03.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.587006] pcieport 0000:63:03.0: Runtime PM usage count underflow!
[ 1855.587039] pcieport 0000:63:02.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.587569] pcieport 0000:63:02.0: Runtime PM usage count underflow!
[ 1855.587589] pcieport 0000:63:01.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.588078] pcieport 0000:63:01.0: Runtime PM usage count underflow!
[ 1855.588095] pcieport 0000:63:00.0: Unable to change power state from D3hot to D0, device inaccessible
[ 1855.588383] pci_bus 0000:64: busn_res: [bus 64] is released
[ 1855.589672] pci_bus 0000:65: busn_res: [bus 65-83] is released
[ 1855.590222] pci_bus 0000:84: busn_res: [bus 84-a2] is released
[ 1855.590704] pci_bus 0000:a3: busn_res: [bus a3-bf] is released
[ 1855.591129] pci_bus 0000:c0: busn_res: [bus c0] is released
[ 1855.591350] pci_bus 0000:63: busn_res: [bus 63-c0] is released
[ 1855.773422] usb 8-1: USB disconnect, device number 2
[ 1855.773433] usb 8-1.4: USB disconnect, device number 3
[ 1855.773437] usb 8-1.4.1: USB disconnect, device number 4
[ 1855.781009] usb 8-1.4.2: USB disconnect, device number 5
[ 1855.876724] usb 7-1.1.2: USB disconnect, device number 7
[ 1855.876732] usb 7-1.1.2.2: USB disconnect, device number 9
[ 1856.085155] usb 7-1.1.2.5: USB disconnect, device number 10
[ 1856.181073] usb 7-1.1.5: USB disconnect, device number 5
[ 1856.244964] usb 7-1.3: USB disconnect, device number 3
[ 1862.733333] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1872.973294] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1873.613289] pcieport 0000:00:08.3: PME: Spurious native interrupt!
[ 1883.213307] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1893.453312] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1903.693571] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1913.933455] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1924.173296] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1934.413285] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1944.653303] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
[ 1954.893484] [drm:amdgpu_dm_process_dmub_aux_transfer_sync [amdgpu]] *ERROR* wait_for_completion_timeout timeout!
I brought back the amdgpu.sg_display=0
workaround and have had more stable experience. I have seen only one momentary black out and one instance of corruption which covered the full screen of the 4K monitor after a reboot for installing updates.
My flickering was resolved and has been caused by a faulty display kit. One issue less, now lets see how long it takes to fix the crash/hang.
AMD GPU Driver Crash
Summary: I found a way to reliably reproduce the crash/hang in a deterministic way and wrote a guide at the bottom of this post.
Background
When running specific GPU loads, my system crashes sometimes. I managed to log the kernel and every single time I get this page fault:
amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32772, for process qemu-system-x86 pid 3344 thread qemu-syste:cs0 pid 3369)
amdgpu: in page starting at address 0x000000003f800000 from client 10
amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
amdgpu: MORE_FAULTS: 0x0
amdgpu: WALKER_ERROR: 0x0
amdgpu: PERMISSION_FAULTS: 0x0
amdgpu: MAPPING_ERROR: 0x0
amdgpu: RW: 0x0
This issue occurs frequently when running 3d workloads inside a Qemu Guest VM. I was not able to reproduce this issue by running the same workload outside the VM. But this is not a Qemu bug, because the crash itself happens on the host system. And guests should never be able to crash their host.
System
Framework 13
AMD Ryzen 7840U
64 GB Memory
BIOS: 03.05
Configuration 1 | Configuration 2 | |
---|---|---|
Host | Fedora 40 | Ubuntu 24.04 LTS (Live Environment) |
Host Kernel | 6.8.10-300.fc40 6.8.11-300.fc40 |
6.8.?? (not completely sure) |
Guest | Alpine 3.19.1 | Fedora 39 |
Crash Behavior | Screen(s) turn black forever until poweroff is forced |
Full system freeze for 10 - 20 seconds. Then it normalizes, until the still running workload triggers it again |
It makes no difference whether I run on battery or with the charger. Disabling/enabling amdgpu.sg_display=0
and/or GamingMode in the BIOS makes no difference either.
Guide To Reproduce The Issue
Time required: ~15 Minutes
Example OS: Ubuntu 24.04 LTS (Live Environment)
1. Setup a VM with 3d acceleration
- Install gnome-boxes trough the software app
- Open gnome-boxes and click the on the top left corner to download an OS
- Select Fedora 39 and do a full installation. Just testing in the live image itself is not enough
- Reboot Fedora 39. Gnome-boxes may not autostart the VM, but you can do this by double-clicking on the VMs icon. Gnome-boxes sometimes crashes here, but just start it again and retry
- Gnome-boxes always boots into the live-image/installer, not the system which you just installed. So you have to interrupt GRUB and move down to
Troubleshooting
. There you can select the first partition. Note: You have to do this on every VM reboot to prevent starting the live-image/installer again - Update Fedora 39 trough the software center. I don’t know if this is required to reproduce the crash, but I did it anyways. Note that you don’t need a full upgrade to Fedora 40, just the basic updates are enough. Then reboot if the updater tells you
- Shut down the VM
- Gnome-boxes has a
3d acceleration
setting in the VMs properties, but unfortunately this does not work. Therefore we will boot our freshly installed Fedora 39 image with Virt-Manager sudo apt install virt-manager
- Start Virt-Manager. If it complains about not connecting to Qemu’s system session, ignore it. Create a new session and select “user session”
- In the menu where you can create a new VM, there is an option to import an existing VM image. Select it. The image should be located somewhere in
~/snap/gnome-boxes/...
- It asks you for an OS name. Enter
Fedora 39
- Select “configure VM after creation” (or something like that)
- You now need to enable several GPU-related settings. I don’t know which order is the right one, but Virt-Manager will tell you. Settings to enable:
- Memory
- Enable shared memory:
- Video Virtio
- Model:
Virtio
- 3d acceleration:
- Model:
- Display Spice
- Listen type:
None
- OpenGL:
- Listen type:
- Press the button on the top left of this window which says something like “complete installation”
2. Trigger the crash
- Boot your VM and open Firefox
- Search for
basemark webgl
and it will send you to this page: https://web.basemark.com/ - Run the benchmark. The first 6 tests will pass, but on test 7/20 is where the crash happens
Note: When I’ve tested this on two different Linux distributions, there where some differences in behaviour. On the Fedora 40 host the crash happens within seconds. On the Ubuntu Live host it took around a minute.
@Alex_H @Sh_Ra Have you both reported issue to upstream drm/amd ?
I’ve ran into the same page fault issue but not with a VM workload and I’m under BIOS 3.05.
EDIT : No graphical glitch on my side however, I haven’t used an external monitor with a higher res than 1920x1080
No, and I don’t really care who and which issue trackers/chip suppliers/whatever should be informed here. I’m just a basic end-user reporting Framework issues back to Framework
My reproduction guide uses only the internal monitor (2256x1504). Just make sure to maximize the VM window and the browser inside it.
Who is responsible for this issue? It prevents me from running certain workloads on fullscreen.
I don’t expect it to get fixed today, I just want it to be seen.
Page faults like that are usually user mode driver (IE mesa) bugs. You should try to reproduce it using the latest mesa and if you can still reproduce it report it there.
I’m unable to reproduce the glitch on my end following the steps…
I ran the benchmark on host and inside the vm, still nothing.
BIOS 3.05, Pop!_OS, kernel 6.8.0, with 1 external monitor
- Are you sure the VM had GPU access? Without it the benchmark still runs on my machine, but doesn’t trigger a crash
- Have you maximized the VM window and its containing browser window?
I used these settings in virt-manager,
maximized the VM window and its containing browser window, yes
This issue is no longer reproducible, no matter what I try. Guys, 4k fullscreen Apps are here now. Truly the year of the Linux desktop.
Kernel: 6.9.11-200.fc40.x86_64
Glad to hear it @Alex_H ! My partner is also on a Framework 13 AMD, running Fedora 40, and says he occasionally finds his system suddenly slows to a crawl (1 frame rendered about every 3 seconds) before hanging completely after about 20 seconds. He does game development so I’m imagining it could be a GPU thing for him too (I watched it happen today while he had a game dev program - Godot - open)
I’ll tell him to make sure his kernel is up-to-date. Just to confirm, did you have to update the BIOS or anything like that as well? If so, does that happen automatically via the Software app or does he need to do anything to trigger it? He hasn’t done anything manual to update the BIOS since he bought the machine 6 months ago.
I don’t know which of the countless updates did it and when it was fixed. Just discovered it a few days ago. But yes, I’m on the latest everything and recall upgrading my BIOS a few months ago. It is just a simple fwupmgr
command and a reboot, there are guides out there. That said, my problem could be different from the one you described. Won’t help you here, but check dmesg for logs.
I think a new post should be made for new issues, I’m still encountering issue on kernel 6.10.3 with the driver AMDGPU (BIOS 3.0.5) but they are very difficult to reproduce :
[61942.825121] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32786)
[61942.825132] amdgpu 0000:c1:00.0: amdgpu: in process .kitty-wrapped pid 3017 thread kitty:cs0 pid 3018)
[61942.825137] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x0000f5ae3ed33000 from client 10
[61942.825140] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701430
[61942.825143] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
[61942.825146] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x0
[61942.825148] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0
[61942.825150] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[61942.825153] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0
[61942.825155] amdgpu 0000:c1:00.0: amdgpu: RW: 0x0
[61952.994551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
EDIT: However this does not do any graphical corruption, just crashes my GPU accelerated terminal, the desktop freezes for a bit and then unfreezes.
This is very likely a mesa bug, it’s not the same as previous.
Got the same issue here. I was playing 1080p video in Firefox and then the desktop froze and crashed
There is another thread talking about this happening with 6.10 on framework website but not with 6.9. does that match your behavior?
You should assist with bisecting if so.