[SOLVED] Amdgpu driver crash during high loads

Hi!

I just received my Framework laptop (16) but I’m experiencing issues with what seems to be the amdgpu driver.

It happens randomy when the system load increases sometimes (it happened on Google maps for example, almost always with Firefox, never on Chromium for now).

To reproduce it, I can just open “Lost-O-Images” on http://webglsamples.org/, check all checkboxes and wait 3~5 seconds. It will either :

  • Freeze the screen, flash to black but recover (not on this website though, the load is too high)
  • Freeze the screen, flash to black and crash xorg so i’ll land on the login view
  • Freeze the screen, flash to black and crash the whole OS, leading to a reboot (kernel panic I guess)

See at the end for an example of crash log (dmesg).

System infos :

  • Debian stable (12) with latest kernel (6.10.11+bpo-amd64) and amdgpu driver (2.4.123-1~bpo12+1) from backports.
  • Framework 16
  • Currently used with a usb-c hub on port #2 with 4 USB, 1 RJ45, 1 HDMI, and power on port #4, audio jack on #5 (but the bug happens when nothing is plugged too, so that should not matter).

I tried adding amdgpu.ppfeaturemask=0xfffd3fff or amdgpu.sg_display=0 to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub (and then update-grub) but it did not help.

I was planning to use this for work but if I can’t make it work it’s just wasted money…

Thanks in advance for your help! :slight_smile:

Crash log :

[   78.581580] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=45147, emitted seq=45149
[   78.581751] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox-esr pid 2838 thread firefox-es:cs0 pid 2906
[   78.581890] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[   82.609877] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[   82.609886] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[   86.458455] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[   86.458464] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[   90.321521] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[   90.321527] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[   94.186092] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[   94.186101] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[   98.050104] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[   98.050113] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[   99.841252] rcu: INFO: rcu_preempt self-detected stall on CPU
[   99.841260] rcu: 	10-....: (5249 ticks this GP) idle=be84/1/0x4000000000000000 softirq=12434/12434 fqs=2621
[   99.841266] rcu: 	(t=5250 jiffies g=12653 q=2400 ncpus=16)
[   99.841270] CPU: 10 PID: 115 Comm: kworker/u64:2 Not tainted 6.10.11+bpo-amd64 #1  Debian 6.10.11-1~bpo12+1
[   99.841273] Hardware name: Framework Laptop 16 (AMD Ryzen 7040 Series)/FRANMZCP07, BIOS 03.03 03/27/2024
[   99.841275] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[   99.841296] RIP: 0010:delay_halt_mwaitx+0x3c/0x50
[   99.841305] Code: 31 d2 48 89 d1 48 05 00 60 00 00 0f 01 fa b8 ff ff ff ff b9 02 00 00 00 48 39 c6 48 0f 46 c6 48 89 c3 b8 f0 00 00 00 0f 01 fb <5b> e9 09 5f 2a 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90
[   99.841307] RSP: 0018:ffffba9ac0573960 EFLAGS: 00000297
[   99.841309] RAX: 00000000000000f0 RBX: 0000000000001d65 RCX: 0000000000000002
[   99.841310] RDX: 0000000000000000 RSI: 0000000000001d65 RDI: 00000063ab30a110
[   99.841312] RBP: 0000000000001d65 R08: 0000000000000100 R09: 0000000000000003
[   99.841314] R10: ffffba9ac0573a68 R11: ffffffff9ecca408 R12: 0000000000000040
[   99.841315] R13: 00000000002dc6c0 R14: ffffa0a324c44290 R15: 0000000000000000
[   99.841317] FS:  0000000000000000(0000) GS:ffffa0aa5e700000(0000) knlGS:0000000000000000
[   99.841319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.841320] CR2: 00007fa0018f1000 CR3: 0000000849c20000 CR4: 0000000000750ef0
[   99.841322] PKRU: 55555554
[   99.841323] Call Trace:
[   99.841326]  <IRQ>
[   99.841330]  ? rcu_dump_cpu_stacks+0xcb/0x110
[   99.841338]  ? rcu_sched_clock_irq+0x347/0x1100
[   99.841346]  ? srso_alias_return_thunk+0x5/0xfbef5
[   99.841351]  ? notifier_call_chain+0x5a/0xd0
[   99.841357]  ? srso_alias_return_thunk+0x5/0xfbef5
[   99.841358]  ? timekeeping_update+0xdd/0x130
[   99.841368]  ? srso_alias_return_thunk+0x5/0xfbef5
[   99.841369]  ? timekeeping_advance+0x377/0x590
[   99.841371]  ? srso_alias_return_thunk+0x5/0xfbef5
[   99.841372]  ? tmigr_requires_handle_remote+0x8d/0x100
[   99.841382]  ? update_process_times+0x6d/0xc0
[   99.841385]  ? tick_nohz_handler+0x8f/0x140
[   99.841394]  ? __pfx_tick_nohz_handler+0x10/0x10
[   99.841397]  ? __hrtimer_run_queues+0x10f/0x2a0
[   99.841400]  ? hrtimer_interrupt+0xfa/0x230
[   99.841403]  ? __sysvec_apic_timer_interrupt+0x55/0x150
[   99.841410]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[   99.841414]  </IRQ>
[   99.841415]  <TASK>
[   99.841416]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   99.841431]  ? delay_halt_mwaitx+0x3c/0x50
[   99.841433]  delay_halt+0x3c/0x70
[   99.841438]  amdgpu_fence_wait_polling+0x36/0x60 [amdgpu]
[   99.841777]  mes_v11_0_submit_pkt_and_poll_completion.constprop.0+0x2cc/0x3f0 [amdgpu]
[   99.841935]  mes_v11_0_unmap_legacy_queue+0x7f/0xd0 [amdgpu]
[   99.842086]  amdgpu_mes_unmap_legacy_queue+0x91/0xd0 [amdgpu]
[   99.842231]  amdgpu_gfx_disable_kcq+0xcf/0x190 [amdgpu]
[   99.842375]  gfx_v11_0_hw_fini+0x4d/0xf0 [amdgpu]
[   99.842518]  amdgpu_device_ip_suspend_phase2+0x102/0x1a0 [amdgpu]
[   99.842633]  ? amdgpu_device_ip_suspend_phase1+0x6c/0xe0 [amdgpu]
[   99.842753]  amdgpu_device_ip_suspend+0x40/0x70 [amdgpu]
[   99.842872]  amdgpu_device_pre_asic_reset+0xd0/0x2a0 [amdgpu]
[   99.842992]  amdgpu_device_gpu_recover+0x347/0xdc0 [amdgpu]
[   99.843113]  ? ___drm_dbg+0x90/0xd0 [drm]
[   99.843134]  amdgpu_job_timedout+0x13d/0x1f0 [amdgpu]
[   99.843296]  drm_sched_job_timedout+0x73/0x100 [gpu_sched]
[   99.843300]  process_one_work+0x179/0x390
[   99.843304]  worker_thread+0x265/0x380
[   99.843307]  ? __pfx_worker_thread+0x10/0x10
[   99.843308]  kthread+0xcf/0x100
[   99.843311]  ? __pfx_kthread+0x10/0x10
[   99.843313]  ret_from_fork+0x31/0x50
[   99.843317]  ? __pfx_kthread+0x10/0x10
[   99.843319]  ret_from_fork_asm+0x1a/0x30
[   99.843324]  </TASK>
[  101.915761] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  101.915767] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  105.776549] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  105.776560] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  109.636910] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  109.636919] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  113.499030] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  113.499037] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  113.918974] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[  113.920553] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[  113.930700] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[  113.931240] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[  113.931467] [drm] VRAM is lost due to GPU reset!
[  113.931474] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[  113.933286] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[  113.934867] [drm] DMUB hardware initialized: version=0x08000500
[  114.749854] pcieport 0000:00:08.1: PME: Spurious native interrupt!
[  114.755339] [drm] kiq ring mec 3 pipe 1 q 0
[  114.757425] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[  114.758323] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  114.758328] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  114.758332] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  114.758334] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[  114.758337] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[  114.758339] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[  114.758341] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[  114.758344] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[  114.758347] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[  114.758349] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  114.758352] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[  114.758354] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[  114.758357] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[  114.770587] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow start
[  114.770589] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow done
[  114.770600] amdgpu 0000:c1:00.0: amdgpu: GPU reset(2) succeeded!
[  114.772660] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Maybe try setting the graphics memory to gaming in the bios. Ran those on my setup and no crashing. Good luck, I hope that you are able to get it resolved.

Thanks, I just tried, but it still crashed :-/ .

Also I forgot to mention that upgrading the kernel did help, it was way more unstable before that.

Sorry to hear that. The only other thing that I can think of is to boot from a live usb and see if you can reproduce it. Say, Fedora, in case you need to open a ticket with Support, as they likely will want you to test it on a supported distro. I am on kernel 6.11.5-arch1-1 on arch if it matters.

I tested it with fedora (this release : Fedora Xfce | The Fedora Project) and it does not crash. However, I experienced something weird : the interface was laggy : I could type on the keyboard or click but the screen would refresh only when I moved the mouse or when too much activity happened. I moved the window with the WebGL experiment to the HDMI screen and it suddenly ran smoothly, no jerky screen refresh.

I’ll ask the support to see what they think, thanks! :slight_smile:

Edit: The kernel on fedora was a 6.11.4-301-fc41.x86_64

1 Like

I contacted the support, and after quite a lot of questions and tests (testing RAM, motherboard settings reset, etc.) we couldn’t find any way to make it work.

I tested xubuntu on a livecd, which seemed to work completely fine out of the box. I then took the decision to reinstall my system (which was quite painful since xubutu does not support cryptsetup alone during the install, so I had to install it on a separate hard drive, copy the files, update crypttab/fstab and rebuild the /boot…). I wasn’t so happy initially because I left ubuntu a while ago because of stability issues (long term), but I hope it has improved.

And finally it works! :slight_smile: . No crash, no slowdowns, no freeze!! :partying_face:

FYI here are the versions I have now :

  • Linux kernel 6.8.0-48-generic
  • libdrm-amdgpu1 2.4.120-2build1

Even if the issue itself isn’t solved, I guess I’ll close this tread

Thanks again! :smiley:

Edit: I’ll close it… If I find how to do that :sweat_smile:

Hi @Dagrut,

Glad to hear you got it resolved. There are so many different linux variants it would be impossible to test all the variants for compatibility. Glad you had the knowledge and experience to basically hand install the distro to your liking!

I will mark the thread solved for you! Congrats on your Framework Laptop 16 and welcome to the community!
:grin:
Tagging @Matt_Hartley in case he has not come across this as he is one of the linux masters for Framework! :white_check_mark:

Linux Linux