AMGDPU crashing in games with ring gfx_0.0.0 timeout, Framework Desktop

I’ve posted this topic on Reddit before, but maybe I can get help here:

Hi, so I’ve been suffering from this issue for a little while now (Framework Desktop), but not right from the beginning and I can’t for the life of me figure out what causes (caused?) it. I use EndeavourOS on Hyprland. My framework desktop is mainboard-only, and I use a Corsair 450w SFX psu. It has 37A to the 32A on 12v Framework recommends. I also use a pretty chonky fan (Silent Wings 4 Pro).

Through a lot of observation, I found out in which exact scenario it happens, and it’s as follows:


When it happens:

SPECIFICALLY, when I play a game that has some sort of scene where there are a lot of very quick FPS dips due to CPU processing, it may crash on one of these dips! Two games where I’ve observed this happen (and others never crash): Blue Protocol, when crafting, every time an item gets made, I have an invisible FPS dip, but Steam’s FPS graph shows that there is an instantaneous dip.

Here’s a video showing the dips: https://files.catbox.moe/nf0304.mp4 and here’s an image showing them in detail: https://files.catbox.moe/om3ctt.png. In the video above, I’ve fixed the game to 30 fps, which seems to alleviate the issue somewhat, since the dips aren’t so wild, it seems to be much less likely to die.

A second game where I have it happen is Zenless Zone Zero; but there, it only really happens in moments of loading, like the agent select, scrolling between agents can cause it, or starting a mission. Never during a mission, really. Only in moments of slight stutter.

I want to underline that it only happens ever in those games that have high frequencies of stutter in a short amount of time. I played a game yesterday that maxes the gpu all the time (100+ watts sustained load), for literally 8 hours in a row (final fantasy vii rebirth), no trouble AT ALL.

It’s only these two games I mentioned atm, and both of them are Unity engine as well…


Thesis

What this brings me to think about is power management, and looking at other threads, yes, people keep suggesting to change power management values. But this hasn’t really worked for me yet. I will attach dmesg and such of when it happens at the end of this writeup. What I think happens is during this small window of “no frames” being rendered by the GPU (CPU stalling) is that the GPU goes into a lower power window, and then right back up. I don’t know WHY this would break, but it seems in conjunction with the current driver implementation, it doesn’t like it. It also doesn’t sound good, to be fair.


What I’ve tried

So here are, in no particular order, things I’ve changed on the system to try to get rid of the issue. I’ve also tried combinations of them (albeit not every combination… probably).

  • Apply kernel values
    • amdgpu.ppfeaturemask=0xffffbfff (and other variants I’ve found, I haven’t documented them all)
    • amdgpu.mes=0 or 1 (didn’t change anything. maybe the “new” MES scheduler is forced on this newer GPU? I definitely know it always announces itself in the dmesg when it breaks, whether or not it’s on or off)
    • amdgpu.pm=0 (it causes the kernel to complain literally every second that it can’t switch power states, also seems to be stuck in a low asf power state then)
    • cwsr_enable=0 - didn’t cause any changes from what I can see
  • Different kernels:
    • linux-drm-next-git, linux-drm-tip-git
    • Since what complains is the kernel, I thought the one with the newest bleeding edge graphics implementations would work, but it didn’t change anything
  • MESA
    • replacing mesa lib32-mesa vulkan-mesa-layers lib32-vulkan-mesa-layers vulkan-radeon lib32-vulkan-radeon vulkan-mesa-implicit-layers lib32-vulkan-mesa-implicit-layers with this family of packages
    • downgrading mesa as far as it goes (but at some point, my libc is too new, so it doesn’t go that far…)
  • linux-firmware-git
    • having the newest firmwares, maybe?
  • proton versions
    • like tkg, em, etc…
  • Considered that maybe the PSU is at fault (during power state switching, the PSU isn’t fast enough, causing brownout issues, making the GPU trip?). I’ve tried two other PSUs, but the FW mobo didn’t like them. It didn’t turn on. I will try my 800w psu in my big tower that drives a 200w cpu and a 300w gpu without issues, though. To put all of that out of the question. I tried the big, new PSU and it didn’t change anything. So much for that!
  • Memtest, came out okay.
  • Reinstalling (Maybe I missed a config that’s ruining it). This time to bare Archlinux. I cherry picked everything on there now, but nope…
  • Playing with LACT to keep the frequency at the highest on the GPU. This helps somewhat, but it just delays the issue. It also causes me to not have to reboot entirely, and even the DE recovers completely, usually.

The logs

This one is with some of the newer, bleeding edge packages, it’s how most of these went down. Game froze, then held that way (while music and audio continued to play as normal). Then blackscreen, unable to switch TTY while this went on. I recorded this thru SSH… [ 0.000000] Linux version 6.17.9-arch1-1 (linux@archlinux) (gcc (GCC) 15.2.1 - Pastebin.com Sometimes it would get unstable so even if I do get it to recover, it would have mouse stuttering every few seconds. So it’s usually a reboot. SysRq, REISUB, which it DID listen to!

Then the most recent one [ 0.000000] Linux version 6.17.9-arch1-1 (linux@archlinux) (gcc (GCC) 15.2.1 - Pastebin.com , in this one, it actually recovered by itself, which just killed the game but the rest kept on living.


I’m sorry for the long-ass post, I just wanted to show you that I’ve tried everything I could so far. Another thing I’m gonna try is to just switch to Plasma which is the most developed Wayland compositor and desktop, and also where Valve has their eyes on. I also considered using a gamescope-steam session to get the compositor completely out of the picture.

This might be related to an issue with the latest AMD firmware in Arch Linux.

The Arch Linux firmware “linux-firmware-amdgpu 20251125-1” contains the firmware file(s) that might cause the issue. Rolling back to “linux-firmware-amdgpu 20251111-1” might fix it.

I can also see this issue on my desktop board 128 GB custom build if I use ROCm. I don’t see this issue when I use llama.cpp with Vulkan. I also don’t see any issue during normal non GPU workloads (e.g. internet browsing). So I didn’t rollback the “linux-firmware-amdgpu“ package on my system.

I also don’t use any of your kernel options.

Does this setting help at all with the crashing?

This one doesn’t help, I actually tried it as well, but forgot to edit it in!

Okay, so I actually started out on 20251111-1 and had the same issue. But this is a good idea, I might actually go even farther back on that firmware to see if it does anything. Maybe that one’s the reason it worked in the very beginning when I just got the desktop. I switched to linux-firmware-amdgpu-20251021 to see.

EDIT: The older firmware still crashes

In the meantime, I found a third game that crashes. Final Fantasy VII rebirth. But not during regular gameplay segments, only specifically during the minigame “Queen’s Blood”. Since there, it exhibits the same sort of “frequent, regular fps dropping” that the other two games have. And of course, within seconds, it freezes too, gpu driver crashing the exact same way.

Otherwise, I can play the game for 8 hours without any crashes. But starting Queen’s Blood, it happens literally in seconds.

So now I have 3 games (across different engines) exhibiting the same pattern due to some erroneous behaviour of the game itself. But since it’s 3 games doing it, I don’t think with this pattern it can still really be blamed on the games either…