[TRACKING] Graphical corruption in Fedora 39 (AMD 3.03 BIOS)

If I’m reading changes for the just-tagged linux-firmware version 20231211 right, the VP9 decoding firmware fix that was offered for testing (and does fix the issue to the extent I tested it) hasn’t been released/made it to linux-firmware yet. I assume it’d still have to be a change to the vcn_* files.

2 Likes

Can confirm that I experienced basically the same. F39 6.5/6.6 workstation with Sway. The issue would occur from time to time with grimshot’s area grab, returning from sleep to swaylock, and fullscreening Firefox. This was about a month ago; since then I’ve had scatter/gather disabled and I haven’t run across the issue ever. iGPU configuration is also set to UMA_GAME_OPTIMIZED.
Edit: I just went to disable UMA_GAME_OPTIMIZED (benchmarking some things and wanted to see the difference) and noticed it wasn’t enabled. I haven’t touched that in about a month so I’m not sure when it was disabled; it might have been disabled this entire time.

I have 64GB RAM (2x32GB).

2 Likes

I see some people in this thread report flickering on external monitors. For me that was never an issue. Only the internal screen flickers under CPU load while charging, no matter what I’m trying.

I had the same issue every other day when recovering my laptop from standby. The amdgpu.sg_display=0fix didn’t work. Switching to UMA_GAME_OPTIMIZED did however. I haven’t had the issue for over a week. I’m running Fedora 39 Silverblue.

Fedora 39 KDE Spin with 16GB RAM
When changing the settings like “Screen Edges → Set something related to windows”, “Windows Management → Windows behavior” etc…
will have a flickering with a white screen on laptop’s own display (didn’t connect to any external display)

The temporary workaround is to log out and log in.
Switching to UMA_GAME_OPTIMIZED solved the problem completely.

Just wanted to report that I am also experiencing this on my new laptop. Running Fedora 39 on 7040 AMD with 64 GB RAM. I have not yet tried UMA_GAME_OPTIMIZED

Mhh even with both (UMA and sg_dislpay) set, i still have the problem from time to time
with sway and grimshot, but yeah it has been reduced dramatically

I too am getting this issue on Fedora 39 and its derivatives. I have disabled sg in my kargs after finding this post, however when I set UMA_GAME_OPTIMIZED, it keeps one of my TPM PCRs from unlocking, although I am unsure of which one (one of 2, 3, 4, 5, or 7), so I have it disabled for now.

FWIW I have not run into this issue on openSUSE Aeon, and considering this is a vram problem, it may be worth noting that openSUSE Aeon does not swap by default, while fedora swaps to zram by default.

Edit: Forgot to mention specs: Ryzen 7840U, 16GB RAM.

I’ve been using 6.7.0-68.fc40.x86_64 and the issues still do not present. I think this will be resolved once this kernel trickles down in to the main distros. :partying_face:

EDIT:
I am not using any special kernel command lines or bios settings for this, it all works out of the box.

I think the reports for amdgpu.sg_display=0 are just coincidental as that argument did not fix the issue for me on 6.6.x.

2 Likes

This still occurs with the 6.7 kernel (including the os-build target after release) intermittently.

Ensuring the reserved VRAM is set to 4GB in the BIOS (UMA_GAME_OPTIMIZED) prevents it from occuring less frequently than the amdgpu scatter gather toggle + Ensuring you are using the most recent DCN/linux-firmware blobs from the amdgpu freedesktop git repo. Those actions combined almost completely eliminate it under lite daily usage.

However depending on what you do with your system it’s still trigerable. I generally encounter it irrespective after specifically any sort of GPU memory intensive operations and/or RAM intensive work (i.e compiling kernels etc) after a resume from suspend.

1 Like

Adding some data points, looks like I might be the first one to provide a kernel crash trace? (see logs below)

system data:

  • Ryzen 7 7840U
  • 16GiB RAM
  • NixOS 23.11
  • kernel 6.7.0
  • KDE Plasma 5.27.10
  • using X11, not wayland
  • DMUB hardware initialized: version=0x08002A00
  • no workarounds applid yet:
    • not in UMA game mode
    • not applied amdgpu.sg_display=0 yet

when the bug appeared: Unfortunately I have no reliable reproducer yet, I keep experimenting with workarounds disabled until I find one.
I’ve encountered the issue in 2 situations so far:

  • 2 times when resuming from a suspend-to-ram state with lid closed
  • 1 time during normal operations, when launching 2 hardware-accelerated video streams at one (in mpv)

@Matt_Hartley So amdgpu.sg_display=0 appears to be the preferred workaround so far. But according to [AMD Re-Enables Scatter/Gather Support For All APUs On Linux - Phoronix](the Phoronix article), AMD themselfs consider scatter/gather an important feature. So if we need to keep this diabled longterm, what do we need to expect? What exactly does scatter/gather even do?

system log (including a kernel trace)

Due to restrictions in post length, I need to cut parts away from my system log. Here’s the part with the kernel crash, the complete log can be found here.

Jan 15 02:59:08 framenix systemd[1]: Starting Pre-Sleep Actions...
Jan 15 02:59:08 framenix systemd[1]: pre-sleep.service: Deactivated successfully.
Jan 15 02:59:08 framenix systemd[1]: Finished Pre-Sleep Actions.
Jan 15 02:59:08 framenix systemd[1]: Reached target Sleep.
Jan 15 02:59:08 framenix systemd[1]: Starting System Suspend...
Jan 15 02:59:08 framenix systemd-sleep[21564]: Entering sleep state 'suspend'...
Jan 15 02:59:08 framenix kernel: PM: suspend entry (s2idle)
Jan 15 02:59:08 framenix kernel: Filesystems sync: 0.005 seconds
Jan 16 20:40:26 framenix kernel: Freezing user space processes
Jan 16 20:40:26 framenix kernel: Freezing user space processes completed (elapsed 0.016 seconds)
Jan 16 20:40:26 framenix kernel: OOM killer disabled.
Jan 16 20:40:26 framenix kernel: Freezing remaining freezable tasks
Jan 16 20:40:26 framenix kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
Jan 16 20:40:26 framenix kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Jan 16 20:40:26 framenix kernel: queueing ieee80211 work while going to suspend
Jan 16 20:40:26 framenix kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jan 16 20:40:26 framenix kernel: ACPI: EC: interrupt blocked
Jan 16 20:40:26 framenix kernel: ACPI: EC: interrupt unblocked
Jan 16 20:40:26 framenix kernel: nvme nvme0: 16/0/0 default/read/poll queues
Jan 16 20:40:26 framenix kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Jan 16 20:40:26 framenix kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Jan 16 20:40:26 framenix kernel: [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
Jan 16 20:40:26 framenix kernel: ------------[ cut here ]------------
Jan 16 20:40:26 framenix kernel: WARNING: CPU: 10 PID: 21581 at drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:1526 dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel: Modules linked in: usbhid sd_mod uas usb_storage scsi_mod r8153_ecm scsi_common ccm qrtr rfcomm af_packet cmac algif_hash algif_skcipher af_alg bnep mt7921e mt7921_common mt792x_lib mt76_connac_lib cdc_mbim cdc_wdm mt76 cdc_ncm cdc_ether usbnet mac80211 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_realtek snd_sof_utils snd_hda_codec_generic snd_soc_core ledtrig_audio snd_hda_codec_hdmi btusb btrtl snd_compress ac97_bus btintel hid_sensor_als snd_pcm_dmaengine hid_sensor_trigger snd_hda_intel btbcm mousedev snd_pci_ps industrialio_triggered_buffer btmtk kfifo_buf snd_rpl_pci_acp6x snd_intel_dspcfg snd_intel_sdw_acpi hid_sensor_iio_common snd_acp_pci bluetooth snd_hda_codec snd_acp_legacy_common cfg80211 industrialio snd_pci_acp6x edac_mce_amd snd_hda_core snd_pci_acp5x nls_iso8859_1 snd_hwdep intel_rapl_msr edac_core xt_conntrack snd_rn_pci_acp3x sp5100_tco nls_cp437 snd_pcm nf_conntrack intel_rapl_common
Jan 16 20:40:26 framenix kernel:  snd_acp_config ucsi_acpi ecdh_generic watchdog crc32_pclmul snd_soc_acpi typec_ucsi hid_multitouch snd_timer vfat hid_sensor_hub rfkill polyval_clmulni cros_ec_lpcs nf_defrag_ipv6 polyval_generic joydev fat hid_generic r8152 gf128mul ecc ghash_clmulni_intel crc16 mii cros_ec snd typec rapl tiny_power_button k10temp soundcore snd_pci_acp3x i2c_piix4 libarc4 battery nf_defrag_ipv4 tpm_crb thermal ac roles i2c_hid_acpi button i2c_hid tpm_tis amd_pmf hid tpm_tis_core platform_profile amd_pmc ip6t_rpfilter evdev mac_hid ipt_rpfilter serio_raw xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat nf_tables nfnetlink sch_fq_codel ctr loop tun tap macvlan bridge stp llc vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp kvm irqbypass fuse efi_pstore configfs zstd zram efivarfs dmi_sysfs ip_tables x_tables autofs4 dm_crypt aes_generic cbc encrypted_keys trusted asn1_encoder tee tpm rng_core xhci_pci xhci_pci_renesas input_leds xhci_hcd led_class nvme sha512_ssse3 sha512_generic atkbd sha256_ssse3 sha1_ssse3 libps2
Jan 16 20:40:26 framenix kernel:  vivaldi_fmap nvme_core thunderbolt usbcore aesni_intel t10_pi libaes crypto_simd crc64_rocksoft cryptd crc64 i8042 crc_t10dif crct10dif_generic usb_common crct10dif_pclmul crct10dif_common serio rtc_cmos dm_mod dax btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq amdgpu i2c_algo_bit drm_ttm_helper ttm agpgart video wmi drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper drm backlight firmware_class
Jan 16 20:40:26 framenix kernel: CPU: 10 PID: 21581 Comm: kworker/u32:26 Tainted: G           O       6.7.0 #1-NixOS
Jan 16 20:40:26 framenix kernel: Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.03 10/17/2023
Jan 16 20:40:26 framenix kernel: Workqueue: events_unbound async_run_entry_fn
Jan 16 20:40:26 framenix kernel: RIP: 0010:dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel: Code: 21 c8 48 c1 e2 38 48 09 d0 48 89 85 98 02 00 00 f6 85 c4 02 00 00 02 74 44 e8 8a ed ff ff 84 c0 75 3b 48 8b 85 d8 01 00 00 90 <0f> 0b 90 c6 85 9c 02 00 00 80 48 8b 40 10 48 8b 30 48 85 f6 74 04
Jan 16 20:40:26 framenix kernel: RSP: 0018:ffffba7e489bbbf0 EFLAGS: 00010246
Jan 16 20:40:26 framenix kernel: RAX: ffffa2a8cb457200 RBX: 00000000ffffffff RCX: 00ffffffffffffff
Jan 16 20:40:26 framenix kernel: RDX: 0000000000000000 RSI: ffffba7e489bbbf0 RDI: 0000000000000000
Jan 16 20:40:26 framenix kernel: RBP: ffffa2a8d009e800 R08: 0000000000000008 R09: 0000000000000000
Jan 16 20:40:26 framenix kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffffa2a8d00a2a00
Jan 16 20:40:26 framenix kernel: R13: ffffa2a8cbc31c70 R14: ffffa2a8d009a800 R15: 0000000000000009
Jan 16 20:40:26 framenix kernel: FS:  0000000000000000(0000) GS:ffffa2ac3df00000(0000) knlGS:0000000000000000
Jan 16 20:40:26 framenix kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 16 20:40:26 framenix kernel: CR2: 00007f7d48000b96 CR3: 0000000471a20000 CR4: 0000000000f50ef0
Jan 16 20:40:26 framenix kernel: PKRU: 55555554
Jan 16 20:40:26 framenix kernel: Call Trace:
Jan 16 20:40:26 framenix kernel:  <TASK>
Jan 16 20:40:26 framenix kernel:  ? dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? __warn+0x81/0x130
Jan 16 20:40:26 framenix kernel:  ? dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? report_bug+0x171/0x1a0
Jan 16 20:40:26 framenix kernel:  ? handle_bug+0x42/0x70
Jan 16 20:40:26 framenix kernel:  ? exc_invalid_op+0x17/0x70
Jan 16 20:40:26 framenix kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jan 16 20:40:26 framenix kernel:  ? dp_retrieve_lttpr_cap+0x122/0x1e0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  link_blank_all_dp_displays+0x56/0xd0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  dcn31_init_hw+0x1d4/0x840 [amdgpu]
Jan 16 20:40:26 framenix kernel:  dc_set_power_state+0x5e/0xa0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  dm_resume+0xfc/0x880 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Jan 16 20:40:26 framenix kernel:  ? _dev_info+0x79/0xa0
Jan 16 20:40:26 framenix kernel:  amdgpu_device_ip_resume_phase2+0x4f/0xc0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  amdgpu_device_resume+0xa0/0x2c0 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? __pfx_pci_pm_resume+0x10/0x10
Jan 16 20:40:26 framenix kernel:  amdgpu_pmops_resume+0x4a/0x80 [amdgpu]
Jan 16 20:40:26 framenix kernel:  ? __pfx_pci_pm_resume+0x10/0x10
Jan 16 20:40:26 framenix kernel:  dpm_run_callback+0x89/0x1b0
Jan 16 20:40:26 framenix kernel:  device_resume+0x88/0x190
Jan 16 20:40:26 framenix kernel:  async_resume+0x1e/0x60
Jan 16 20:40:26 framenix kernel:  async_run_entry_fn+0x31/0x130
Jan 16 20:40:26 framenix kernel:  process_one_work+0x173/0x340
Jan 16 20:40:26 framenix kernel:  worker_thread+0x27b/0x3a0
Jan 16 20:40:26 framenix kernel:  ? __pfx_worker_thread+0x10/0x10
Jan 16 20:40:26 framenix kernel:  kthread+0xd4/0x100
Jan 16 20:40:26 framenix kernel:  ? __pfx_kthread+0x10/0x10
Jan 16 20:40:26 framenix kernel:  ret_from_fork+0x31/0x50
Jan 16 20:40:26 framenix kernel:  ? __pfx_kthread+0x10/0x10
Jan 16 20:40:26 framenix kernel:  ret_from_fork_asm+0x1b/0x30
Jan 16 20:40:26 framenix kernel:  </TASK>
Jan 16 20:40:26 framenix kernel: ---[ end trace 0000000000000000 ]---
Jan 16 20:40:26 framenix kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
Jan 16 20:40:26 framenix kernel: amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
Jan 16 20:40:26 framenix kernel: [drm] ring gfx_32803.1.1 was added
Jan 16 20:40:26 framenix kernel: [drm] ring compute_32803.2.2 was added
Jan 16 20:40:26 framenix kernel: [drm] ring sdma_32803.3.3 was added
Jan 16 20:40:26 framenix kernel: [drm] ring gfx_32803.1.1 ib test pass
Jan 16 20:40:26 framenix kernel: [drm] ring compute_32803.2.2 ib test pass
Jan 16 20:40:26 framenix kernel: [drm] ring sdma_32803.3.3 ib test pass
Jan 16 20:40:26 framenix kernel: ucsi_acpi USBC000:00: GET_CONNECTOR_STATUS failed (-5)
Jan 16 20:40:26 framenix kernel: OOM killer enabled.
Jan 16 20:40:26 framenix kernel: Restarting tasks ... 

On reporting this upstream: Is the kernel bugzilla or freedsktop/drm the proper place for such bug reports?

Great idea, but I lack the cycles to dedicate to it personally. Addtionally, we are seeing updates and changes coming through for AMDGPU and related very rapidy. So anything I tried, would likely spend most of its time out of date as we continue pushing through.

If you feel comfortable doing so, I would be happy to take a post of your creation and making it a wiki (like we have done with Debian threads and similar in the community/maintained by the community).

Remember, Linux support is a team of 2 people. So we are laser focused on two distros and making sure the duplicable behavior we see in testing is tracked and validated. Community findings and discoveries are very welcome and encouraged.

2 Likes

Yes Scatter/Gather support is important for performance. I’m no expert, but I think it means that the DMA can be done to multiple buffers that are not in contiguous memory. I think that explains why I’m only seeing some full screen buffers being corrupted. Notice how there are many suppressed IOMMU errors in your full log. Those errors go away when I disable scatter/gather.

Please do report it upstream. I have meant to do it but would be happy if someone else does the work :smile: I’m pretty sure the right place to do it is the freedesktop.org bug tracker since the same issue was closed for people who had 64 GB. That workaround clearly didn’t solve the problem, and people with less RAM are also affected.

UPDATE: I forgot to add that the kernel source code literally says to please report the issue if you have to disable scatter/gather.

I use Fedora 39 and have 32 GB RAM. I’ve made no changes, and the out of box experience is pretty bad. I can reproduce the problem in about 5-10 minutes using Firefox and a few 3D accelerated games from Steam. It only appears to dramatically affect the programs when they are in full screen, but it also makes my machine a lot more crash prone in general (maybe a memory corruption?), and makes resuming less stable. I’ve made a video showing the issue on my machine.

Maybe @Mario_Limonciello can add more information on what scatter/gather does performance wise and so on.

Please @spiollinux report your issue upstream on freedesktop AMDGPU repository, there might also be an actual issue already open.

1 Like

Is setting UMA_Game_OPTIMIZED the best option right now??
How is UMA_GAME_OPTIMIZED and UMA_Auto different?

edit: also facing this issue on a random basis…

RAM is shared between CPU and GPU on this platform, dynamically that is - so whoever needs some more can allocate it. The UMA setting controls how much of that RAM is assigned to the GPU by default.

The main goal of that setting is to work around some games checking for dedicated GPU RAM and complaining that it’s not enough.

However, since the platform is very new there still seem to be a (or a few) bug that does create problems, especially with suspends. Opting for the UMA_GAME_OPTIMIZED will circumvent these issues from appearing by allocating more dedicated RAM to the GPU directly. The downside is that less RAM is available for the rest of the system.

I expect that this will be solved in the future via software (OS) Updates.

Cheers
/herodot

3 Likes

thanks for the detailed explanation!!

I expect that this will be solved in the future via software (OS) Updates.

I also hope that it will be fixed, maybe after applying newer kernels… since I don’t game, losing some RAM isn’t very nice lol (in my case roughly 3.8 GB loss)

Since you don’t game, I think your best option for now is turning off the scatter/gather feature. I have done it, and it makes the computer much more stable (and I can still game on it). I used grubby to set amdgpu.sg_display=0.

1 Like

Looking on the AMD DMA (Direct Memory Access) documentation, it afaik, appears to enable the DMA ‘descriptor’ to be split into two different memory locations by putting each part into their own descriptors and referencing them to each other.

The “Scatter” part refers to splitting a very large DMA Operation into several smaller descriptors linked together.

The “Gather” part refers to merging them together by looking for a special suffix on each of these operations (except the last operation which has a normal suffix).

The GPU framebuffer can only write to one location at a time, this enables the GPU to “scatter” the larger operations among many of the DMA descriptors.

The DMA can be imagined like a large chain, where you can only add or remove one chain link at a time and the last chain link always points to the first chain link.

5 Likes

I am having graphical errors with this machine as well. The external monitor will completely turn white when Firefox or any other app is full screen. Sigh
If I unplug the monitor while it’s white, the main display immediately glitches out.