[SOLVED] Amdgpu crashes and artifacts with Mesa 25, kernel 6.13

And now it just happened on 6.12.10 which is from mid-January. Specifically, after resuming from sleep. I also have two external monitors attached via TB4, so my setup pushes the limits of the chipset but that just means I trip on these things more often than people using only the laptop panel.

It clearly happens more often on 6.13.x versions but I’m not sure how far to go back to find a kernel version that works reliably. I’m sure this wasn’t happening in the fall.

Reading this and several other threads about this, I believe the issue is with Mesa-25, not the Kernel at all. I’m running Fedora 41, not Tumbleweed; however after downgrading to Mesa-24 I haven’t seen issues. But I also haven’t stressed the system enough to call it fixed.

As best I know there is no simple reproduction reported to the Mesa project at this time.

5 Likes

Thanks, Christopher. I disabled the various updates-testing repos and then sudo dnf distro-sync --allowerasing to get back to Mesa 24. I’ll let you know how it goes.

1 Like

Personal advice

Skip 6.12 and 6.13 as possible, very buggy.

Screen issues, VM issues, ROCM doesn’t work.

Ise 6.10, 6.11 or some 6.14 kernels. 6.14 rc5 works good to me

I am yet another Tumbleweeder using Gnome+Wayland, also with a AMD Ryzen™ 5 7640U laptop. Thanks for the tip about kernel-longterm; I just installed it.

FYI, here are a couple of openSUSE bugzilla reports that I’ve been following as (perhaps) being related to our problem:

I notice that BIOS 3.07 has recently been released. Has anyone seen if it resolves any of this?

1 Like

Thanks for the recap :folded_hands:, I’ll keep an eye on those.

I installed 3.07 a few days ago and it didn’t change a thing. kernel-longterm is the only fix for me, at least it seems to be that way so far (zero freezes).

Kernel-longterm has also resolved the issue for me, at least for the past 24 hours. Also, the openSUSE bugzilla reports I mentioned above have been resolved without fixing our issue, so I’ve submitted a new one, 1239657 - Garbling screen artifacts after suspend/resume, specifically for what we’re seeing. Please add any additional information you may have. (It’ll show there are more than one person having the problem.)

I think I’ll hold off on 3.07 for now. No reason to roll too many dice at the same time. :wink:

1 Like

Using Arch Linux. Framework 16 on Linux 6.13.7

I downgraded packages that contained mesa or radeon in the name and were at version 25.x (specifically 25.0.1) to v24.

In my commandline I have amdgpu.dcdebugmask=0x10 amdgpu.gpu_recovery=1

So far I don’t see the amdgpu page fault citing firefox.
We’ll see. I won’t consider it a valid workaround yet but hopeful for now

UPDATE 1 week later: Yup, it seems mesa was the issue. 25.0,2 is out but latest post says it doesn’t help so I’m staying in v24.

1 Like

I’m yet another FW13 AMD running Fedora 41.
After some testing, I’m sure the problem is caused by mesa 25, and not necessarely correlated to firefox.
I’ve tested a number of times upgrading and downgrading between mesa 24.x and mesa 25, so far every time i downgrade all the artifacts appearing on screen disappeard. I’ve dnf versionlock mesa-dri-drivers to 24.2.4-1.fc41 for now and seems to have fixed the problem

1 Like

Thanks folks for pointing this in the right direction. By any chance, is any of you aware of a Mesa bug report, or have you filed one?

Thanks

EDIT: looks like a report exists

2 Likes

I’m getting graphics lock-ups that halt the whole system till it auto resets.

[11371.628554] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[11371.630858] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[11371.640927] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=2396193, emitted seq=2396195
[11371.640932] amdgpu 0000:c1:00.0: amdgpu: Process information: process Diablo IV.exe pid 25118 thread vkd3d_queue pid 25189
[11371.640934] amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[11373.644731] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[11373.644738] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[11373.644838] amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failure
[11373.644840] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[11375.703519] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[11375.703533] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[11375.974835] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[11375.976534] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[11376.009082] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[11376.009762] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[11376.009859] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[11376.011934] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[11376.019220] [drm] DMUB hardware initialized: version=0x08004800
[11376.344978] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[11376.344987] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[11376.344990] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[11376.344993] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[11376.344995] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[11376.344997] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[11376.344999] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[11376.345002] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[11376.345004] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[11376.345006] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[11376.345009] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[11376.345011] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[11376.345014] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[11376.347522] amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) succeeded!
[11384.960509] usb 1-4: reset full-speed USB device number 3 using xhci_hcd
[11385.232465] usb 1-4: reset full-speed USB device number 3 using xhci_hcd
[11407.601853] amdgpu 0000:c1:00.0: amdgpu: VM memory stats for proc Diablo IV.exe(25189) task vkd3d_queue(25118) is non-zero when fini
[12706.265489] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[14498.283304] perf: interrupt took too long (3134 > 3128), lowering kernel.perf_event_max_sample_rate to 63750
[15648.482345] ucsi_acpi USBC000:00: unknown error 256
[15648.482352] ucsi_acpi USBC000:00: GET_CABLE_PROPERTY failed (-5)
[15780.063892] usb 1-4: reset full-speed USB device number 3 using xhci_hcd
[15780.351902] usb 1-4: reset full-speed USB device number 3 using xhci_hcd
[18752.454834] perf: interrupt took too long (3926 > 3917), lowering kernel.perf_event_max_sample_rate to 50750
[22363.849334] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[22363.852200] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[22363.862251] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=13023333, emitted seq=13023335
[22363.862256] amdgpu 0000:c1:00.0: amdgpu: Process information: process Diablo IV.exe pid 107761 thread vkd3d_queue pid 107873
[22363.862259] amdgpu 0000:c1:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[22365.866087] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[22365.866093] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[22365.866199] amdgpu 0000:c1:00.0: amdgpu: Ring gfx_0.0.0 reset failure
[22365.866202] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[22367.925667] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[22367.925683] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[22368.196607] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[22368.198294] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[22368.231906] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[22368.232606] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[22368.232716] [drm] VRAM is lost due to GPU reset!
[22368.232723] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[22368.234736] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[22368.242021] [drm] DMUB hardware initialized: version=0x08004800
[22368.567696] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[22368.567703] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[22368.567706] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[22368.567708] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[22368.567709] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[22368.567711] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[22368.567712] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[22368.567714] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[22368.567716] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[22368.567717] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[22368.567719] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[22368.567721] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[22368.567723] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[22368.569927] amdgpu 0000:c1:00.0: amdgpu: GPU reset(4) succeeded!

This has happened a number of times in the past while just browsing. I’ve had it pretty consistently happen while playing Diablo IV. It seems able to recover when playing a game, but is never recoverable when browsing with Firefox.

Not running Flatpaks, as mentioned early on in the thread.
OS: Debian Sid, XFCE (X11)
Running Linux 6.14.0-rc7
With Framework Laptop 13 - AMD Ryzen 7 7840U

Full kernal perams: amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 usbcore.autosuspend=-1 loglevel=4 i8042.unlock=1

Running Fedora Silverblue 41 on a Framework Laptop 13 - AMD Ryzen 7 7840U with the 2.8K display and was experiencing these graphical artifacts and GPU crashes, most of the time I would just get kicked back to GDM when it crashed.

I downgraded from Mesa 25.0.1 to Mesa 24.2.4 and that cleared up all the problems. I noticed Mesa 25.0.2 came out yesterday so I hope these issues are cleared up when it lands in Fedora.

Seems Mesa 25.0.2 doesn’t fix all off these issues

No, 25.0.2 hasn’t fixed the GPU crashing. Can’t downgrade on Debian Sid unfortunately due to the dependency hell.

1 Like

I don’t even have that kernel parameter enabled and I still face the issue.

If you read previous comments, it looks more likely to be caused by Mesa 25.

In a previous comment I pasted the link to the issue on the Mesa bug tracker.

Thread renamed, tags added to better describe the issue

I’ve been having issues with amdgpu acting up for the better part of the last year, since around kernel 6.9. The solution I’ve been successfully using since February has been putting amdgpu.dcdebugmask=0x12 in the kernel command line. Right now I’m running 6.13.7 with mesa 25 without any issues. Good to hear though that 6.14 might be coming with an actual solution.

1 Like

I just got Linux 6.14 on Arch Linux
I’ll do some experiments

UPDATE: I rebooted with mesa 25.0.2 and withing 1-2 hours the problem mentioned in this thread appeared. So Mesa 25 is still a no-go even with Linux 6.14…
Specifically this time it appeared when I opened a video with mpv. Everything froze except the cursor for 7 seconds, and sure enough when I checked dmesg I saw adgpu page fault in process mpv spammed.

My hunch is that this bug appears when playing video. Either firefox or mpv both will find themselves playing video.

The good news is that people seem to have located the commit in mesa that causes the issue

openSUSE seems to be adding a patch to its builds to revert it.

Is that the same issue? They don’t mention page faults there, only the artifacts

UPDATE 2: Seems like it isn’t nesessarilty video specific, since on the thread one person got in on kwin_wayland. And yup he reports the page faults like we see here, so it is the same issue. Arch could also adopt the revert patch tbh, but according to its philosphy it might wait to see it upstreamed first.

What remains to be seen is if I’ll get artifacts and freezes from PSR-SU ( have removed dcdebugmask from cmdline). I think this is a seperate issue from the one mentioned in this thread but still in the amdgpu umbrella.

UPDATE 3: 3 days in, it seems like dcdebugmask might not be needed anymore in 6.14. Will continue testing.

5 Likes

Great to hear about the Mesa progress. Regarding kernel 6.14, a patch has landed temporarily disabling PSR for eDP displays (essentially what the debug flag does), so the lack of need for kernel argument is sadly not indicative of a fix. Fortunately this still means the bug is getting attention, so we might be getting an actual fix soon.

3 Likes

Not seeing any issues under 6.14 (via mainline) and Ubuntu 24.10, so that’s Mesa 24.2.8

Ubuntu have the beta of 25.04 out Right Now, which uses 6.14 kernel, and probably Mesa 25 but I’ve not had a chance to check it with a live USB yet.