[TRACKING] Graphical corruption in Fedora 39 (AMD 3.03 BIOS)

Adding some data points, looks like I might be the first one to provide a kernel crash trace? (see logs below)

system data:

  • Ryzen 7 7840U
  • 16GiB RAM
  • NixOS 23.11
  • kernel 6.7.0
  • KDE Plasma 5.27.10
  • using X11, not wayland
  • DMUB hardware initialized: version=0x08002A00
  • no workarounds applid yet:
    • not in UMA game mode
    • not applied amdgpu.sg_display=0 yet

when the bug appeared: Unfortunately I have no reliable reproducer yet, I keep experimenting with workarounds disabled until I find one.
I’ve encountered the issue in 2 situations so far:

  • 2 times when resuming from a suspend-to-ram state with lid closed
  • 1 time during normal operations, when launching 2 hardware-accelerated video streams at one (in mpv)

@Matt_Hartley So amdgpu.sg_display=0 appears to be the preferred workaround so far. But according to [AMD Re-Enables Scatter/Gather Support For All APUs On Linux - Phoronix](the Phoronix article), AMD themselfs consider scatter/gather an important feature. So if we need to keep this diabled longterm, what do we need to expect? What exactly does scatter/gather even do?

system log (including a kernel trace)

Due to restrictions in post length, I need to cut parts away from my system log. Here’s the part with the kernel crash, the complete log can be found here.

Jan 15 02:59:08 framenix systemd[1]: Starting Pre-Sleep Actions...
Jan 15 02:59:08 framenix systemd[1]: pre-sleep.service: Deactivated successfully.
Jan 15 02:59:08 framenix systemd[1]: Finished Pre-Sleep Actions.
Jan 15 02:59:08 framenix systemd[1]: Reached target Sleep.
Jan 15 02:59:08 framenix systemd[1]: Starting System Suspend...
Jan 15 02:59:08 framenix systemd-sleep[21564]: Entering sleep state 'suspend'...
Jan 15 02:59:08 framenix kernel: PM: suspend entry (s2idle)
Jan 15 02:59:08 framenix kernel: Filesystems sync: 0.005 seconds
Jan 16 20:40:26 framenix kernel: Freezing user space processes
Jan 16 20:40:26 framenix kernel: Freezing user space processes completed (elapsed 0.016 seconds)
Jan 16 20:40:26 framenix kernel: OOM killer disabled.
Jan 16 20:40:26 framenix kernel: Freezing remaining freezable tasks
Jan 16 20:40:26 framenix kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
Jan 16 20:40:26 framenix kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Jan 16 20:40:26 framenix kernel: queueing ieee80211 work while going to suspend
Jan 16 20:40:26 framenix kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jan 16 20:40:26 framenix kernel: ACPI: EC: interrupt blocked
Jan 16 20:40:26 framenix kernel: ACPI: EC: interrupt unblocked
Jan 16 20:40:26 framenix kernel: nvme nvme0: 16/0/0 default/read/poll queues
Jan 16 20:40:26 framenix kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Jan 16 20:40:26 framenix kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Jan 16 20:40:26 framenix kernel: [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
Jan 16 20:40:26 framenix kernel: ------------[ cut here ]------------
Jan 16 20:40:26 framenix kernel: WARNING: CPU: 10 PID: 21581 at drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:1526 dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel: Modules linked in: usbhid sd_mod uas usb_storage scsi_mod r8153_ecm scsi_common ccm qrtr rfcomm af_packet cmac algif_hash algif_skcipher af_alg bnep mt7921e mt7921_common mt792x_lib mt76_connac_lib cdc_mbim cdc_wdm mt76 cdc_ncm cdc_ether usbnet mac80211 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_realtek snd_sof_utils snd_hda_codec_generic snd_soc_core ledtrig_audio snd_hda_codec_hdmi btusb btrtl snd_compress ac97_bus btintel hid_sensor_als snd_pcm_dmaengine hid_sensor_trigger snd_hda_intel btbcm mousedev snd_pci_ps industrialio_triggered_buffer btmtk kfifo_buf snd_rpl_pci_acp6x snd_intel_dspcfg snd_intel_sdw_acpi hid_sensor_iio_common snd_acp_pci bluetooth snd_hda_codec snd_acp_legacy_common cfg80211 industrialio snd_pci_acp6x edac_mce_amd snd_hda_core snd_pci_acp5x nls_iso8859_1 snd_hwdep intel_rapl_msr edac_core xt_conntrack snd_rn_pci_acp3x sp5100_tco nls_cp437 snd_pcm nf_conntrack intel_rapl_common
Jan 16 20:40:26 framenix kernel:  snd_acp_config ucsi_acpi ecdh_generic watchdog crc32_pclmul snd_soc_acpi typec_ucsi hid_multitouch snd_timer vfat hid_sensor_hub rfkill polyval_clmulni cros_ec_lpcs nf_defrag_ipv6 polyval_generic joydev fat hid_generic r8152 gf128mul ecc ghash_clmulni_intel crc16 mii cros_ec snd typec rapl tiny_power_button k10temp soundcore snd_pci_acp3x i2c_piix4 libarc4 battery nf_defrag_ipv4 tpm_crb thermal ac roles i2c_hid_acpi button i2c_hid tpm_tis amd_pmf hid tpm_tis_core platform_profile amd_pmc ip6t_rpfilter evdev mac_hid ipt_rpfilter serio_raw xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat nf_tables nfnetlink sch_fq_codel ctr loop tun tap macvlan bridge stp llc vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp kvm irqbypass fuse efi_pstore configfs zstd zram efivarfs dmi_sysfs ip_tables x_tables autofs4 dm_crypt aes_generic cbc encrypted_keys trusted asn1_encoder tee tpm rng_core xhci_pci xhci_pci_renesas input_leds xhci_hcd led_class nvme sha512_ssse3 sha512_generic atkbd sha256_ssse3 sha1_ssse3 libps2
Jan 16 20:40:26 framenix kernel:  vivaldi_fmap nvme_core thunderbolt usbcore aesni_intel t10_pi libaes crypto_simd crc64_rocksoft cryptd crc64 i8042 crc_t10dif crct10dif_generic usb_common crct10dif_pclmul crct10dif_common serio rtc_cmos dm_mod dax btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq amdgpu i2c_algo_bit drm_ttm_helper ttm agpgart video wmi drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper drm backlight firmware_class
Jan 16 20:40:26 framenix kernel: CPU: 10 PID: 21581 Comm: kworker/u32:26 Tainted: G           O       6.7.0 #1-NixOS
Jan 16 20:40:26 framenix kernel: Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.03 10/17/2023
Jan 16 20:40:26 framenix kernel: Workqueue: events_unbound async_run_entry_fn
Jan 16 20:40:26 framenix kernel: RIP: 0010:dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel: Code: 21 c8 48 c1 e2 38 48 09 d0 48 89 85 98 02 00 00 f6 85 c4 02 00 00 02 74 44 e8 8a ed ff ff 84 c0 75 3b 48 8b 85 d8 01 00 00 90 <0f> 0b 90 c6 85 9c 02 00 00 80 48 8b 40 10 48 8b 30 48 85 f6 74 04
Jan 16 20:40:26 framenix kernel: RSP: 0018:ffffba7e489bbbf0 EFLAGS: 00010246
Jan 16 20:40:26 framenix kernel: RAX: ffffa2a8cb457200 RBX: 00000000ffffffff RCX: 00ffffffffffffff
Jan 16 20:40:26 framenix kernel: RDX: 0000000000000000 RSI: ffffba7e489bbbf0 RDI: 0000000000000000
Jan 16 20:40:26 framenix kernel: RBP: ffffa2a8d009e800 R08: 0000000000000008 R09: 0000000000000000
Jan 16 20:40:26 framenix kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffffa2a8d00a2a00
Jan 16 20:40:26 framenix kernel: R13: ffffa2a8cbc31c70 R14: ffffa2a8d009a800 R15: 0000000000000009
Jan 16 20:40:26 framenix kernel: FS:  0000000000000000(0000) GS:ffffa2ac3df00000(0000) knlGS:0000000000000000
Jan 16 20:40:26 framenix kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 16 20:40:26 framenix kernel: CR2: 00007f7d48000b96 CR3: 0000000471a20000 CR4: 0000000000f50ef0
Jan 16 20:40:26 framenix kernel: PKRU: 55555554
Jan 16 20:40:26 framenix kernel: Call Trace:
Jan 16 20:40:26 framenix kernel:  <TASK>
Jan 16 20:40:26 framenix kernel:  ? dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? __warn+0x81/0x130
Jan 16 20:40:26 framenix kernel:  ? dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? report_bug+0x171/0x1a0
Jan 16 20:40:26 framenix kernel:  ? handle_bug+0x42/0x70
Jan 16 20:40:26 framenix kernel:  ? exc_invalid_op+0x17/0x70
Jan 16 20:40:26 framenix kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jan 16 20:40:26 framenix kernel:  ? dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  link_blank_all_dp_displays+0x56/0xd0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  dcn31_init_hw+0x1d4/0x840 [amdgpu]
Jan 16 20:40:26 framenix kernel:  dc_set_power_state+0x5e/0xa0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  dm_resume+0xfc/0x880 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 20:40:26 framenix kernel:  ? _dev_info+0x79/0xa0
Jan 16 20:40:26 framenix kernel:  amdgpu_device_ip_resume_phase2+0x4f/0xc0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  amdgpu_device_resume+0xa0/0x2c0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? __pfx_pci_pm_resume+0x10/0x10
Jan 16 20:40:26 framenix kernel:  amdgpu_pmops_resume+0x4a/0x80 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? __pfx_pci_pm_resume+0x10/0x10
Jan 16 20:40:26 framenix kernel:  dpm_run_callback+0x89/0x1b0
Jan 16 20:40:26 framenix kernel:  device_resume+0x88/0x190
Jan 16 20:40:26 framenix kernel:  async_resume+0x1e/0x60
Jan 16 20:40:26 framenix kernel:  async_run_entry_fn+0x31/0x130
Jan 16 20:40:26 framenix kernel:  process_one_work+0x173/0x340
Jan 16 20:40:26 framenix kernel:  worker_thread+0x27b/0x3a0
Jan 16 20:40:26 framenix kernel:  ? __pfx_worker_thread+0x10/0x10
Jan 16 20:40:26 framenix kernel:  kthread+0xd4/0x100
Jan 16 20:40:26 framenix kernel:  ? __pfx_kthread+0x10/0x10
Jan 16 20:40:26 framenix kernel:  ret_from_fork+0x31/0x50
Jan 16 20:40:26 framenix kernel:  ? __pfx_kthread+0x10/0x10
Jan 16 20:40:26 framenix kernel:  ret_from_fork_asm+0x1b/0x30
Jan 16 20:40:26 framenix kernel:  </TASK>
Jan 16 20:40:26 framenix kernel: ---[ end trace 0000000000000000 ]---
Jan 16 20:40:26 framenix kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
Jan 16 20:40:26 framenix kernel: [drm] ring gfx_32803.1.1 was added
Jan 16 20:40:26 framenix kernel: [drm] ring compute_32803.2.2 was added
Jan 16 20:40:26 framenix kernel: [drm] ring sdma_32803.3.3 was added
Jan 16 20:40:26 framenix kernel: [drm] ring gfx_32803.1.1 ib test pass
Jan 16 20:40:26 framenix kernel: [drm] ring compute_32803.2.2 ib test pass
Jan 16 20:40:26 framenix kernel: [drm] ring sdma_32803.3.3 ib test pass
Jan 16 20:40:26 framenix kernel: ucsi_acpi USBC000:00: GET_CONNECTOR_STATUS failed (-5)
Jan 16 20:40:26 framenix kernel: OOM killer enabled.
Jan 16 20:40:26 framenix kernel: Restarting tasks ... 

On reporting this upstream: Is the kernel bugzilla or freedsktop/drm the proper place for such bug reports?

Great idea, but I lack the cycles to dedicate to it personally. Addtionally, we are seeing updates and changes coming through for AMDGPU and related very rapidy. So anything I tried, would likely spend most of its time out of date as we continue pushing through.

If you feel comfortable doing so, I would be happy to take a post of your creation and making it a wiki (like we have done with Debian threads and similar in the community/maintained by the community).

Remember, Linux support is a team of 2 people. So we are laser focused on two distros and making sure the duplicable behavior we see in testing is tracked and validated. Community findings and discoveries are very welcome and encouraged.

2 Likes

Yes Scatter/Gather support is important for performance. I’m no expert, but I think it means that the DMA can be done to multiple buffers that are not in contiguous memory. I think that explains why I’m only seeing some full screen buffers being corrupted. Notice how there are many suppressed IOMMU errors in your full log. Those errors go away when I disable scatter/gather.

Please do report it upstream. I have meant to do it but would be happy if someone else does the work :smile: I’m pretty sure the right place to do it is the freedesktop.org bug tracker since the same issue was closed for people who had 64 GB. That workaround clearly didn’t solve the problem, and people with less RAM are also affected.

UPDATE: I forgot to add that the kernel source code literally says to please report the issue if you have to disable scatter/gather.

I use Fedora 39 and have 32 GB RAM. I’ve made no changes, and the out of box experience is pretty bad. I can reproduce the problem in about 5-10 minutes using Firefox and a few 3D accelerated games from Steam. It only appears to dramatically affect the programs when they are in full screen, but it also makes my machine a lot more crash prone in general (maybe a memory corruption?), and makes resuming less stable. I’ve made a video showing the issue on my machine.

Maybe @Mario_Limonciello can add more information on what scatter/gather does performance wise and so on.

Please @spiollinux report your issue upstream on freedesktop AMDGPU repository, there might also be an actual issue already open.

1 Like

Is setting UMA_Game_OPTIMIZED the best option right now??
How is UMA_GAME_OPTIMIZED and UMA_Auto different?

edit: also facing this issue on a random basis…

RAM is shared between CPU and GPU on this platform, dynamically that is - so whoever needs some more can allocate it. The UMA setting controls how much of that RAM is assigned to the GPU by default.

The main goal of that setting is to work around some games checking for dedicated GPU RAM and complaining that it’s not enough.

However, since the platform is very new there still seem to be a (or a few) bug that does create problems, especially with suspends. Opting for the UMA_GAME_OPTIMIZED will circumvent these issues from appearing by allocating more dedicated RAM to the GPU directly. The downside is that less RAM is available for the rest of the system.

I expect that this will be solved in the future via software (OS) Updates.

Cheers
/herodot

3 Likes

thanks for the detailed explanation!!

I expect that this will be solved in the future via software (OS) Updates.

I also hope that it will be fixed, maybe after applying newer kernels… since I don’t game, losing some RAM isn’t very nice lol (in my case roughly 3.8 GB loss)

Since you don’t game, I think your best option for now is turning off the scatter/gather feature. I have done it, and it makes the computer much more stable (and I can still game on it). I used grubby to set amdgpu.sg_display=0.

1 Like

Looking on the AMD DMA (Direct Memory Access) documentation, it afaik, appears to enable the DMA ‘descriptor’ to be split into two different memory locations by putting each part into their own descriptors and referencing them to each other.

The “Scatter” part refers to splitting a very large DMA Operation into several smaller descriptors linked together.

The “Gather” part refers to merging them together by looking for a special suffix on each of these operations (except the last operation which has a normal suffix).

The GPU framebuffer can only write to one location at a time, this enables the GPU to “scatter” the larger operations among many of the DMA descriptors.

The DMA can be imagined like a large chain, where you can only add or remove one chain link at a time and the last chain link always points to the first chain link.

5 Likes

I am having graphical errors with this machine as well. The external monitor will completely turn white when Firefox or any other app is full screen. Sigh
If I unplug the monitor while it’s white, the main display immediately glitches out.

Hi there!

I experienced the same graphical glitching (in my case my monitor turned into some kind of strobe), but I noticed it was only when I had multiple thing running:

  • Firefox with a bunch of tabs, including 1 Google Meet
  • Discord
  • Spotify
  • Steam
  • Terminals

As I was still setting up my laptop, I had worked on it just a little bit earlier and didn’t have a lot of programs running (or FF with 20+ tabs).

I did switch my iGPU to GAME_OPTIMIZED mode, which assigns it 4 GB vRAM and then I started up all programs again. It hasn’t triggered (yet?), but I also wanted to check whether or not memory pressure in the iGPU can cause this. So I ran GitHub - Umio-Yasuno/amdgpu_top: Tool to display AMDGPU usage, I think this answers it:

Assuming the ‘normal’ iGPU gets 1 GB of vRAM, my current software workload required a little more then 1GB to run properly. Whether or not that’s desireable or a bug somewhere else, I don’t know :slight_smile:

2 Likes

UPDATE: so I still have the issue, but the occurences have still greatly dimished. The only situation in which I can still encounter this issue is resuming from hibernation (with my ‘usual’ stack of software running).
The solution so far been very simple, just suspend (I use suspend-then-hibernate) and resume the laptop immediately.

Fedora just updated to kernel 6.7. Now I get the ‘Christmas lighting effect’ on my laptop screen

1 Like

What’s the Christmas lighting effect? :nerd_face: seems a bit off-season? :smiley:

There is a bug in kernel 6.7:
Flickering coloured glitchyness on 780M Phoenix iGPU with 6.7.0 kernel on Xorg/plasma (#3097) · Issues · drm / amd · GitLab
Please try this patch linked in there:
https://gitlab.freedesktop.org/drm/amd/uploads/7961021a4cac7db04f50fb99ccaf5b14/0001-drm-buddy-Improve-alloc_range-error-handling-routine.patch

1 Like

hahaha yeah. This is what I’m talking about: flickering and display corruption with linux-next-20230918 and mesa-23.1.{6,7} (#2859) · Issues · drm / amd · GitLab and [RESPONDED] Blocky artifacts on AMD Framework laptop 13 - #7 by Steven_Scheepers

thanks! Is this patch upstream yet?

Thanks for the heads-up then, I’ll wait with the update a bit (maybe not until Christmas though… :wink: )

1 Like

It was just posted an hour ago to the bug. It’s not upstream, but you can test it and if it works you can report that back to the bug to help it land sooner.

1 Like

This is really interesting, because i have the same problem on Arch even with the latest 6.7.3 kernel. It is worse in Wayland than with X11. Both on internal and external monitor. But only with the internal GPU, with an external one, the problem is gone. The “amdgpu.sg_display=0” option mostly fixed it for me but appears to have a substantial negative impact on performance.