FW16 system freeze due to GPU issue?

As per title. I am using the onboard GPU. My system decided to freeze. This started happening a couple of weeks ago with little thing, like my text editor freezing, or short browser freezes. I didn’t make much of it as I am, by now, used to see almost everything fail… anyway. Today my FW16 froze complete, for a few seconds, black screen, then the display came back but dimmed (I have auto dimming of as that is just completely useless and overactive). It seems, according to dmesg that the gpu has issues? [949897.015157] usb 1-4.1: reset full-speed USB device number 7 using xhci_hcd
[949906.620079] usb 1-4.1: reset full-speed USB device number 7 using xhci_hcd
[949910.388247] usb 1-4.1: reset full-speed USB device number 7 using xhci_hcd
[953387.075248] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State
[953387.077961] amdgpu 0000:c1:00.0: amdgpu: Dumping IP State Completed
[953387.088028] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=100822092, emitted seq=100822094
[953387.088033] amdgpu 0000:c1:00.0: amdgpu: Process information: process gnome-shell pid 1610 thread gnome-shel:cs0 pid 1623
[953389.091867] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=RESET
[953389.091874] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] ERROR failed to reset legacy queue
[953389.092158] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[953391.129736] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[953391.129744] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] ERROR failed to unmap legacy queue
[953391.342934] [drm:gfx_v11_0_cp_gfx_enable.isra.0 [amdgpu]] ERROR failed to halt cp gfx
[953391.344583] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[953391.384251] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[953391.384992] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[953391.385049] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming…
[953391.386998] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[953391.394433] [drm] DMUB hardware initialized: version=0x08004D00
[953392.246251] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[953392.246264] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[953392.246268] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[953392.246270] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[953392.246273] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[953392.246275] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[953392.246277] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[953392.246279] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[953392.246282] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[953392.246285] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[953392.246288] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[953392.246290] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[953392.246293] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[953392.249033] amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) succeeded!
[953392.283827] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Failed to initialize parser -125!
[953417.221745] usb 1-4.1: reset full-speed USB device number 7 using xhci_hcd
[953417.395540] usb 1-4.1: reset full-speed USB device number 7 using xhci_hcd

I am on pop os.

Hi,

These sorts of problems are dependent on three things:

  1. gpu firmware. Make sure you have the latest linux-firmware installed.
  2. gpu driver. Make sure you have the most up to date linux kernel. You don’t mention which version you currently have. 6.15.2 might be available.
  3. mesa. Get the most up to date version you can.

The only answer I can give you is: 6.12.10-76061203-generic on PoP OS!

I’ve tried these kernel parameters and it helped me with some similar GPU-related freezes:

    "amdgpu.dc=1"           # Enable display core
    "amdgpu.gpu_recovery=1" # Enable GPU recovery on hangs
    "amd_pstate=passive"    # Sometimes active mode causes issues
1 Like

I reinstalled the OS. I don’t want to go into fixing these thing anymore.

Completely understandable. I love this new laptop but getting it to work reliably has been a real adventure; there are 3 separate big issues which all need some kind of workaround and it took a pretty significant length of time to sort it all out.

If reinstalling doesn’t solve it completely, I also found this kernel which seems to solve the GPU issues completely:

Hope this is useful and sorry for your trouble.

1 Like

To be clear, it hasn’t been unreliable up until a few weeks ago, where it would seemingly randomly freeze. I am sorry to say, but all “technology” is failing quicker than a train leaves a station.
1: My 5y old LG tv is stuck on webos 4, can’t install recent apps.
2: My homepod suffers from loud pops.
3: My 1y old car needs constant updates to keep it driveable.
4: And now my 1y old framework laptop becomes unreliable.
it’s all crap, just crap and nothing else.

Anyway, with that out of my system, I reinstalled it with ubuntu 25.04. Runs fine for a week or so, than a complete lockup today, I made a video of it. Had to turn it off and on again. This is from dmesg, the laptop was not connected to a charger at the time it froze, was about 75%:

 [   12.687074] tee tee0: Direct firmware load for /amdtee/f29bb3d9-bd66-5441-afb88acc2b2b60d6.bin failed with error -2
[   12.687084] failed to load firmware /amdtee/f29bb3d9-bd66-5441-afb88acc2b2b60d6.bin
[   12.687090] failed to copy TA binary
[   12.687095] Failed to open TEE session err:0x0, rc:-12
[   12.687103] amd-pmf AMDI0102:00: Failed to open TA session (-12)
[   12.687119] amd-pmf AMDI0102:00: registered PMF device successfully
[   12.688811] cros_ec_lpcs cros_ec_lpcs.0: Chrome EC device registered
[   12.691376] [drm] Initialized amdxdna_accel_driver 0.0.0 for 0000:c2:00.1 on minor 0
[   12.717643] usbcore: registered new interface driver btusb
[   12.723595] snd_hda_intel 0000:c1:00.1: enabling device (0000 -> 0002)
[   12.723693] snd_hda_intel 0000:c1:00.1: Handle vga_switcheroo audio client
[   12.723791] snd_hda_intel 0000:c1:00.6: enabling device (0000 -> 0002)
[   12.728050] cros-charge-control cros-charge-control.5.auto: Framework charge control detected, preventing load
[   12.728748] mt7921e 0000:01:00.0: enabling device (0000 -> 0002)
[   12.728802] kvm_amd: TSC scaling supported
[   12.728805] kvm_amd: Nested Virtualization enabled
[   12.728806] kvm_amd: Nested Paging enabled
[   12.728808] kvm_amd: LBR virtualization supported
[   12.728816] kvm_amd: Virtual GIF supported
[   12.728817] kvm_amd: Virtual NMI enabled
[   12.736389] mt7921e 0000:01:00.0: ASIC revision: 79220010
[   12.736840] cros-usbpd-charger cros-usbpd-charger.6.auto: No USB PD charging ports found
[   12.738276] cros-usbpd-charger cros-usbpd-charger.6.auto: Unexpected number of charge port count
[   12.738281] cros-usbpd-charger cros-usbpd-charger.6.auto: Failing probe (err:0xffffffb9)
[   12.738285] cros-usbpd-charger cros-usbpd-charger.6.auto: probe with driver cros-usbpd-charger failed with error -71

Completely agree. I’ve now had the GPU lockup again, on the AMD staging kernel. Also wireless flakiness I was experiencing has come back even though I disabled power saving (which I thought was the solution).

IDK man. I filed bugs about both, we’ll see how it works. Framework support doesn’t seem real confidence inspiring in terms of helping to solve the issues either. I think for the wireless thing, I may just wind up purchasing a non Mediatek card, I’ve heard from people who’ve done that and it actually addressed the issue.

If you do find a solution to the GPU thing, let me know, I’d love to try it out.

I am digging deeper, in syslog I found this:

tgnome-shell[1958]: Connection to xwayland lost
gnome-shell[1958]: Xwayland terminated, exiting since it was mandatory
gnome-shell[1958]: JS ERROR: Gio.IOErrorEnum: Xwayland exited unexpectedly

It happend again, now on ubuntu 25.04, but maybe different reasons? It’s not very reliable. It happened right after connecting my BT headset and playing a YT video, after less than 30s the laptop froze solid. Made a video of the frozen state, but thats not helpfull I guess. Bluetooth: hci0: SCO packet for unknown connection handle 3584
Bluetooth: hci0: ACL packet for unknown connection handle 3837

Yeah. FW support recommended that I enable “gaming mode” in the BIOS, which allocates 2 GB of memory to the GPU at all times, and that seems to make it happen way less often. With that, it becomes extremely intermittent (basically once every 1-2 weeks it’ll crash which is sure not great, but tolerable I guess), maybe that can help you.

It keeps crashing, the crashes seem to involve bluetooth or video playback in the browser. It crashes/freezes every 2 weeks or so (I made some videos of it freezing). I was running the popos kernel 6.16, seems to be too cutting edge? I now reverted to 6.8. Hope it works stable with this kernel.

I’ve been running for two weeks on kernel 6.6.101, and I haven’t seen the GPU freeze in all that time. I may start a massively time-consuming process of git bisect to try to figure out if there is a specific place between 6.6 and 6.16 where a GPU regression was introduced. Of course, it may be fooling me, I may see a freeze tomorrow on this kernel which has happened in the past right after I reported that something or other had solved it.

I’m still getting wireless issues now with the non-Mediatek card. They’re not super painful but the driver is having internal errors and resetting itself which causes brief network dropouts. I’ve also seen weird chatter in the log about the disk subsystem. Framework support is saying that it may just be a mainboard issue which they’re willing to RMA if I can reproduce the wireless issues on Ubuntu, which is pretty kind of them. I’m planning to check over the weekend and see how the AX210 network card does when running under Ubuntu. It does kind of make sense that Mediatek issues + AX210 issues + GPU issues might be rooted back to just a very slightly broken mainboard. IDK. I’ll keep you posted.

Okay, just to keep you posted as promised:

Framework shipped me a replacement mainboard, which seems to have resolved both the GPU issues and the network issues (at least with the AX210, I don’t really want to fool with the Mediatek anymore, but the AX210 is rock-solid now so far). The new mainboard has some cooling issues, which is a separate thing, but it’s at least stable as far as I can tell so long as I don’t stress the CPU.

So, confusingly enough, it seems like there are three separate things any one of which fix the GPU freeze:

  • Running a 6.6 kernel
  • Updating BIOS to 03.07
  • Getting a new mainboard

The new mainboard has 03.03 BIOS FWIW. I have no idea. Anyway, that all is what I observed, and now I don’t get the crash anymore so it seems unlikely that I’ll ever be able to learn more (if there was in fact any explanation worth learning that’s more complex than “broken mainboard”).