[TRACKING] Graphical corruption in Fedora 39 (AMD 3.03 BIOS)

Still having the issues if I turn off UMA_GAME_OPTIMIZED in the bios.
I hope the 6.7.x kernel will fix it :frowning:

I still get them in 6.7.3 Fedora 39

A little update: Just updated the kernel to 6.7.4 and it has made things worse. The external screen turns white more frequently. I would just permanently set the UMA_GAME_OPTIMIZED option to on.
I don’t know if it’s a hardware or a driver issue, but it’s really frustrating.

3 Likes

Is it safe to say those of us with AMD GPUs should stick with the 6.6.x kernels?

No need to avoid updating. You can just set the UMA_GAME_OPTIMIZED option and it will work fine :slight_smile: . It’s good that AMD is looking into it.

1 Like

Awesome, thank you for this. Is this filed with Fedora if it is not already? It’s brand new it looks like, so I am sure it has not been - asking to avoid duplication.

Not filed.
This is upstream submission.

[PATCH 1/2] drm/buddy: Fix alloc_range() error handling code (kernel.org)

2 Likes

Not sure if this is related here but external displays have started blowing up my CPU usage for some reason, 7640u F39 KDE Kernel 6.7.4-200.

@Mario_Limonciello I’ve applied the patch you linked to 6.7.4 and I’m still seeing issues when unplugging my USB-C display. It coincides with the IOMMU reporting lots of errors:

Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: Using 44-bit DMA addresses
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc00000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc01000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc02000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc03000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc04000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc05000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc06000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc07000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc08000 flags=0x0000]
Feb 14 15:15:01 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xfffffc84000 flags=0x0000]
...
Feb 14 15:15:06 avalon kernel: amd_iommu_report_page_fault: 80096 callbacks suppressed
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c0000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c1000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c2000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c3000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c4000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c5000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c6000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3c7000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3e0000 flags=0x0000]
Feb 14 15:15:06 avalon kernel: amdgpu 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffffe3cc000 flags=0x0000]

The addresses also look fishy. I only have 32G of RAM and the addresses are in the 4 TB range?

The patch helps only with the graphical corruption, not this white screen + IOMMU issue that is reported by several people. Yes; the IOMMU is the messenger here. It’s still not clear where the bug is that is causing this.

2 Likes

Ah, ok. Is there a known workaround besides disabling the IOMMU? Or an upstream bug report to follow?

Besides losing some capabilities in virtualizing PCIe devices, is there any downside to having the IOMMU disabled? I use this laptop mainly as a workstation and I do run some VMs, but not with PCIe devices assigned to them.

Apart from that, the issue is still happening so far only after hibernate/suspend on Fedora 39 with kernel 6.7.3-200.fc39.x86_64 , so no change as of yet :slight_smile:

I had issues with the sd-card controller on my old laptop without the IOMMU. iommu=soft worked around that.

The problem with disabling the IOMMU is that a buggy (malicous) driver/device may now have an easier time corrupting memory in your system. That may or may not be an issue for you.

1 Like

As a side note, I’m not sure if this is actually an IOMMU problem or a case of the GPU scribbling over random physical memory (on the driver’s mistaken directions) and mostly getting away with it unless the IOMMU catches it red-handed. Whether you’re fine allowing this to run and hoping for the best is for you to decide, of course.

1 Like

Right, thanks, I’ll leave it enabled then (I only ever used it for virtualizing PCIe devices, but I never knew it also handled stuff like this ;-).

There’s an upstream bug report that was opened today.

3 Likes

The graphical corruption isn’t specific to Framework but the AMD cpu?

Thus far it’s only been reported by users on Framework laptops. I have an educated but unsubstantiated suspicion it’s related to a BIOS interaction.

3 Likes

So after already enabling UMA_GAME_OPTIMIZED in the BIOS the issues became less frequent, but still appeared. I have also added amdgpu.sg_display=0 to my kernel boot parameters and so far the issues have dissappeared. So the workaround posted by @Mario_Limonciello in the upstream bug seem to work just fine :slight_smile:

It will be interesting to see if the same issue will be reproducible on the zen4 AGPU sku’s.

I haven’t seen it on the zen3 APU’s i’ve tested with (5700G)- but they are based on the much older navi IP block.