[RESPONDED] Crashing amdgpu on AMD Ryzen 7040 13-inch (Ubuntu 22.04)

Hi there,

I received my system two weeks ago and had this error using Windows 11 and 10 and now with Linux. I tested with 64 GB RAM and with my (preferred configuration, 96 GB).

When utilizing VMs (VMware or VirtualBox) (with or without 3d support enabled). The amdgpu driver crashes consistently after more than two VMs running after a couple of minutes of using the VMs.

I already experimented with different kernel settings (like restricting the GPU VRAM to 2048 GB while at the same time activating Gaming mode in BIOS to ensure 4096 being initially allocated), but nothing seems to really matter. The crashes keep happening. After the driver crash the system is still reachable via ssh. I can interact with it, but powering down fully doesn’t work. It keeps being “on” but unresponsive. I need to long-press the Power-button to shut it down and boot it up again.

My GRUB command line (at the moment is):

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.power_dpm_state=performance amdgpu.power_dpm_force_performance_level=high amdgpu.gpu_recovery=1 amd_pstate=active rtc_cmos.use_acpi_alarm=1 pcie_aspm=off"

The kernel is:

Linux systemName 6.5.0-1011-oem #12-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan  3 20:17:42 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The latest crash was:

Jan  9 14:43:59 systemName kernel: [ 1064.894981] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.03 10/17/2023

Jan 13 11:02:24 systemName kernel: [ 1490.004412] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=108303, emitted seq=108305
Jan 13 11:02:24 systemName kernel: [ 1490.005060] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2685 thread Xorg:cs0 pid 2765
Jan 13 11:02:24 systemName kernel: [ 1490.005256] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
Jan 13 11:02:24 systemName kernel: [ 1490.010556] amdgpu_cs_ioctl: 40 callbacks suppressed
Jan 13 11:02:24 systemName kernel: [ 1490.010559] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 13 11:02:29 systemName kernel: [ 1494.832181] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 13 11:02:29 systemName kernel: [ 1494.832191] amdgpu 0000:c1:00.0: amdgpu: Failed to disable gfxoff!
Jan 13 11:02:31 systemName kernel: [ 1497.101217] ------------[ cut here ]------------
Jan 13 11:02:31 systemName kernel: [ 1497.101223] WARNING: CPU: 12 PID: 4374 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn314/dcn314_smu.c:159 dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.101557] Modules linked in: xt_MASQUERADE xt_tcpudp xt_mark nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ccm rfcomm cmac algif_hash algif_skcipher af_alg vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nf_tables nfnetlink overlay bnep snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp intel_rapl_msr snd_sof_pci intel_rapl_common snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic snd_compress binfmt_misc btusb ledtrig_audio ac97_bus snd_hda_codec_hdmi mt7921e mt7921_common btrtl snd_pcm_dmaengine edac_mce_amd snd_hda_intel mt76_connac_lib btbcm snd_pci_ps snd_intel_dspcfg kvm_amd mt76 btintel snd_rpl_pci_acp6x snd_intel_sdw_acpi snd_hda_codec snd_acp_pci btmtk hid_sensor_als nls_iso8859_1 snd_pci_acp6x hid_sensor_trigger kvm snd_hda_core mac80211 bluetooth snd_pci_acp5x industrialio_triggered_buffer snd_hwdep snd_rn_pci_acp3x kfifo_buf snd_pcm input_leds irqbypass snd_acp_config ecdh_generic hid_sensor_iio_common cfg80211 ecc rapl snd_timer
Jan 13 11:02:31 systemName kernel: [ 1497.101627]  serio_raw snd_soc_acpi joydev industrialio hid_multitouch k10temp snd ccp snd_pci_acp3x libarc4 soundcore mac_hid amd_pmf amd_pmc platform_profile sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua parport_pc ppdev lp parport efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu amdxcp iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm crct10dif_pclmul drm_display_helper crc32_pclmul cec polyval_clmulni rc_core polyval_generic hid_sensor_hub ghash_clmulni_intel drm_kms_helper hid_generic aesni_intel drm nvme cros_ec_lpcs ucsi_acpi crypto_simd i2c_hid_acpi xhci_pci nvme_core cros_ec typec_ucsi cryptd video thunderbolt i2c_piix4 i2c_hid xhci_pci_renesas nvme_common typec wmi hid
Jan 13 11:02:31 systemName kernel: [ 1497.101708] CPU: 12 PID: 4374 Comm: kworker/u32:2 Tainted: G           O       6.5.0-1011-oem #12-Ubuntu
Jan 13 11:02:31 systemName kernel: [ 1497.101714] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jan 13 11:02:31 systemName kernel: [ 1497.101724] RIP: 0010:dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.101947] Code: 41 5e 5d 31 d2 31 c9 31 f6 31 ff e9 cd 59 7b cc 89 da 48 c7 c6 78 8f 58 c1 48 c7 c7 b0 bc 17 c1 e8 f8 c9 f2 cb e9 37 ff ff ff <0f> 0b 49 8b 3c 24 b9 80 84 1e 00 44 89 f2 44 89 ee e8 ad 12 df ff
Jan 13 11:02:31 systemName kernel: [ 1497.101950] RSP: 0018:ffff9d964372b8b0 EFLAGS: 00010246
Jan 13 11:02:31 systemName kernel: [ 1497.101953] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101955] RBP: ffff9d964372b8d0 R08: 0000000000000000 R09: 0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101956] R10: 0000000000000000 R11: 0000000000000000 R12: ffff922681097800
Jan 13 11:02:31 systemName kernel: [ 1497.101957] R13: 0000000000000015 R14: 0000000000000500 R15: ffff9226a11a0000
Jan 13 11:02:31 systemName kernel: [ 1497.101959] FS:  0000000000000000(0000) GS:ffff923cc2100000(0000) knlGS:0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 13 11:02:31 systemName kernel: [ 1497.101962] CR2: 0000563662411548 CR3: 0000000195e8a000 CR4: 0000000000750ee0
Jan 13 11:02:31 systemName kernel: [ 1497.101964] PKRU: 55555554
Jan 13 11:02:31 systemName kernel: [ 1497.101965] Call Trace:
Jan 13 11:02:31 systemName kernel: [ 1497.101967]  <TASK>
Jan 13 11:02:31 systemName kernel: [ 1497.101971]  ? show_regs+0x6d/0x80
Jan 13 11:02:31 systemName kernel: [ 1497.101978]  ? __warn+0x89/0x160
Jan 13 11:02:31 systemName kernel: [ 1497.101983]  ? dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102201]  ? report_bug+0x17e/0x1b0
Jan 13 11:02:31 systemName kernel: [ 1497.102209]  ? handle_bug+0x46/0x90
Jan 13 11:02:31 systemName kernel: [ 1497.102214]  ? exc_invalid_op+0x18/0x80
Jan 13 11:02:31 systemName kernel: [ 1497.102217]  ? asm_exc_invalid_op+0x1b/0x20
Jan 13 11:02:31 systemName kernel: [ 1497.102224]  ? dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102450]  dcn314_smu_set_zstate_support+0x42/0x60 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102662]  dcn314_update_clocks+0x473/0x550 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102869]  ? srso_alias_return_thunk+0x5/0x7f
Jan 13 11:02:31 systemName kernel: [ 1497.102874]  ? dm_read_reg_func+0x60/0xf0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103090]  dcn20_optimize_bandwidth+0x13e/0x290 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103305]  dc_commit_state_no_check+0x91d/0xd30 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103509]  dc_commit_streams+0x311/0x6c0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103711]  dm_suspend+0x202/0x260 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103923]  amdgpu_device_ip_suspend_phase1+0xb2/0x1c0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104066]  amdgpu_device_ip_suspend+0x20/0x80 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104208]  amdgpu_device_pre_asic_reset+0xd4/0x490 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104351]  amdgpu_device_gpu_recover+0x4ad/0xa70 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104494]  amdgpu_job_timedout+0x182/0x270 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104689]  drm_sched_job_timedout+0x6d/0x120 [gpu_sched]
Jan 13 11:02:31 systemName kernel: [ 1497.104696]  process_one_work+0x23d/0x450
Jan 13 11:02:31 systemName kernel: [ 1497.104702]  worker_thread+0x50/0x3f0
Jan 13 11:02:31 systemName kernel: [ 1497.104704]  ? srso_alias_return_thunk+0x5/0x7f
Jan 13 11:02:31 systemName kernel: [ 1497.104706]  ? __pfx_worker_thread+0x10/0x10
Jan 13 11:02:31 systemName kernel: [ 1497.104708]  kthread+0xef/0x120
Jan 13 11:02:31 systemName kernel: [ 1497.104712]  ? __pfx_kthread+0x10/0x10
Jan 13 11:02:31 systemName kernel: [ 1497.104715]  ret_from_fork+0x44/0x70
Jan 13 11:02:31 systemName kernel: [ 1497.104720]  ? __pfx_kthread+0x10/0x10
Jan 13 11:02:31 systemName kernel: [ 1497.104722]  ret_from_fork_asm+0x1b/0x30
Jan 13 11:02:31 systemName kernel: [ 1497.104728]  </TASK>
Jan 13 11:02:31 systemName kernel: [ 1497.104729] ---[ end trace 0000000000000000 ]---
Jan 13 11:02:31 systemName kernel: [ 1497.200178] VirtualBoxVM[3941]: segfault at 0 ip 00007f53472f14e0 sp 00007fff5bd8a370 error 4
Jan 13 11:02:31 systemName kernel: [ 1497.200189] VirtualBoxVM[4216]: segfault at 0 ip 00007f1478af14e0 sp 00007ffec11e1750 error 4
Jan 13 11:02:31 systemName kernel: [ 1497.200190]  in libQt5Gui.so.5.15.3[7f53472e2000+4df000] likely on CPU 0 (core 0, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200195]  in libQt5Gui.so.5.15.3[7f1478ae2000+4df000]
Jan 13 11:02:31 systemName kernel: [ 1497.200196] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.200198]  likely on CPU 4 (core 2, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200200] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.200357] VirtualBoxVM[4013]: segfault at 0 ip 00007f69414f14e0 sp 00007fffeda92480 error 4 in libQt5Gui.so.5.15.3[7f69414e2000+4df000] likely on CPU 5 (core 2, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200368] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.200459] VirtualBoxVM[4078]: segfault at 0 ip 00007f0b572f14e0 sp 00007ffd24617510 error 4 in libQt5Gui.so.5.15.3[7f0b572e2000+4df000] likely on CPU 4 (core 2, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200467] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.206206] VirtualBoxVM[4146]: segfault at 0 ip 00007fda09aeeb08 sp 00007ffedf2d67b8 error 4 in libQt5Gui.so.5.15.3[7fda09ae2000+4df000] likely on CPU 15 (core 7, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.206222] Code: 4d 00 48 89 44 24 18 31 c0 e8 f4 a8 ff ff 48 8b 44 24 28 64 48 2b 04 25 28 00 00 00 74 05 e8 ef 88 ff ff 31 c0 48 83 c4 38 c3 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 05 37 36 4d 00 66 0f ef c0 48
Jan 13 11:02:36 systemName kernel: [ 1501.829121] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1501.829396] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1501.957658] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1501.957783] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.086064] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.086183] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.214428] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.214546] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.342772] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.342993] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.471236] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.471355] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.599593] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.599709] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.727935] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.728053] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:37 systemName kernel: [ 1502.856283] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:37 systemName kernel: [ 1502.856399] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:37 systemName kernel: [ 1502.857891] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
Jan 13 11:02:41 systemName kernel: [ 1507.749339] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 13 11:02:41 systemName kernel: [ 1507.749349] amdgpu 0000:c1:00.0: amdgpu: Mode2 reset failed!
Jan 13 11:02:41 systemName kernel: [ 1507.749353] amdgpu 0000:c1:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:c1:00.0
Jan 13 11:02:41 systemName kernel: [ 1507.749417] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 13 11:02:41 systemName kernel: [ 1507.749930] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
Jan 13 11:02:41 systemName kernel: [ 1507.750038] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Jan 13 11:02:46 systemName kernel: [ 1512.673746] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 13 11:02:46 systemName kernel: [ 1512.673753] amdgpu 0000:c1:00.0: amdgpu: Failed to SetDriverDramAddr!
Jan 13 11:02:46 systemName kernel: [ 1512.673755] amdgpu 0000:c1:00.0: amdgpu: Failed to setup smc hw!
Jan 13 11:02:46 systemName kernel: [ 1512.673758] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62

As I said it is (mostly) triggerable with VMware Workstation 17 and VirtualBox (not installed at the same time, before you ask ;-)). But also using other GPU related applications can trigger it.

I bought two 64 GB bundles to verify and two 96 GB bundles before running memcheck86+ multiple times to ensure the RAM is not faulty. Do you have any suggestions what I should do?

Thank you very much!
marpie

1 Like

From what you’ve shared this looks more like a userspace bug (mesa or software). The GPU transaction timed out so the GPU tried to reset to recover.

The reset fails which is more likely a kernel bug but it really depends on the whole set of circumstances to determine this.

To better narrow it down:

  1. Can you reproduce it with Wayland as well?
  2. Can you reproduce using OEM 6.1 kernel?
  3. Are you sure virtual box triggers it? Don’t use VMware that boot at all.
  1. Can you still reproduce by dropping the amdgpu module parameters you’ve added for dpm?

BTW
amd_pstate=active is unnecessary; this is the kernel default in kernel 6.5 or later.

  1. yes I used Fedora 39 first and it was the same.
  2. First I used the default Ubuntu kernel, then I saw here in the forum the mentioned 6.1-oem Kernel and tried it. This reduced somewhat the crashes. Switching to 6.5-oem (subjectively) reduced the crash frequency further.
  3. No I’m not sure. But I can use the system for several hours with VS Code, Firefox and many terminals. But running two to three VMs in parallel quickly triggers the crash.

The really crazy thing is, that the same crashes were happening on Windows 11 (and 10) as soon as I used gpt4all (with a GPU model).

Oh and yes, VMware is completely uninstalled and I manually removed the kernel modules and powered down the system several times.

Yes, I started without the parameters first when switching kernels to see if it was all fixed already, but unfortunately it still occurs.

OK - I think it’s best raised with Mesa. In order to get the most help I think you should come up with a set of reproduce steps that are easiest for someone to follow with the latest stack. IOW I’d suggest:

  1. Install Fedora 39; do all the updates (it tracks newer kernel and newer mesa).
  2. Install Virtualbox and set up a few VMs (free software that anyone can access).
  3. Set up sshd.
  4. Set up umr. This is in the Fedora repos not in Ubuntu’s though.
  5. Find another box you can ssh in at the time of the failure.
  6. Turn off gpu recovery so the GPU doesn’t try to recover when this issue happens.
  7. Familiarize yourself with how to dump the GC registers with UMR.
  8. Reproduce issue.
  9. When you reproduce it, ssh in from your other box and capture the following.
    a. Kernel log (dmesg or journalctl -k)
    b. All the GC registers using umr.

Hopefully that should get you a good bug report that a mesa developer can analyze.

Not saying it’s not the same but what makes you think it’s the same? Does the GPU also fail to recover in Windows? It’s really hard to tell crashes are often the same in Windows and Linux since they use the hardware differently. But if you’re right it is the same this is pointing at a lower-level platform firmware problem not a driver problem.

No I’m not really sure it is the same bug. The only thing that makes me feel like it is the same bug is the behavior. GPU intensive application is started, a few seconds pass and the screen turns black and does not recover.

After resetting the system using the Power-button the event log shows a generic GPU error.

I will try with Fedora 39 the next day. Thank you very much for your help.

Please do test this against Fedora 39 and see if the issue occurs there as well.

I’ve had this issue (and others), it was solved by updating the amdgpu firmware blobs to the latest from the firmware repository (I will not explain this here, it can be googled), kernel 6.5 and adding amdgpu.sg_display=0 to the display. The first two solved the crashes, the latter solved the “display looks like a christmastreelights spectacle” and white flashing after resume.

Hi consp, I replaced the bin files as part of the upgrade to 6.5 already. I added sg_display back and try to repro it. I’m in contact with the support and after several hours testing the last couple of days I have a good repo using sysbench and glmark2 to trigger the crash consistently when temperatures are getting higher.

So I will run the tests and report back.

So the system keeps crashing after some time. As soon as the system is “warmed up” it keeps crashing even in normal software like Firefox or vi on the command line. So this did (unfortunately) not fix the issue for me, I’ll hope the support can help me with a replacement hardware.

Just want to confirm that I’m seeing exactly the same behaviour even with the suggested workarounds. Running latest Fedora 39 (with 64GB memory and game optimised mode), and system grinds to a complete halt (though sometimes can be safely shut down via SSH) with similar messages, and will happen much more frequently if using Firefox and doing light browsing - video decoding is a completely different (and known) issue that causes other hangs.
Very infrequently (maybe 1/3rd of the time?) the GPU manages to recover and reset, but generally not. This happens several times a day unfortunately.

Jan 26 13:50:30 kronk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_unified_0 timeout, signaled seq=2668, emitted seq=2670
Jan 26 13:50:30 kronk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RDD Process pid 4246 thread firefox-bi:cs0 pid 24983
Jan 26 13:50:30 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
Jan 26 13:50:32 kronk kernel: [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002n
Jan 26 13:50:32 kronk kernel: [drm] Register(0) [regUVD_RB_RPTR] failed to reach value 0x000001c0 != 0x00000140n
Jan 26 13:50:35 kronk kernel: [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002n
Jan 26 13:50:35 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 26 13:50:35 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Failed to disable gfxoff!
Jan 26 13:50:40 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 26 13:50:40 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Failed to power gate VCN!
Jan 26 13:50:40 kronk kernel: [drm:vcn_v4_0_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. 
Jan 26 13:50:43 kronk kernel: ------------[ cut here ]------------
Jan 26 13:50:43 kronk kernel: WARNING: CPU: 6 PID: 22763 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn314/dcn314_smu.c:159 dcn314_smu_send_msg_with_param+0x108/0x180 [amdgpu]
Jan 26 13:50:43 kronk kernel: Modules linked in: exfat cdc_mbim cdc_wdm cdc_ncm uas cdc_ether usb_storage usbnet mii overlay uinput hid_logitech_hidpp uhid rfcomm snd_seq_dummy snd_hrtimer ip6table_nat tun nf_conntrack_ne>
Jan 26 13:50:43 kronk kernel:  irqbypass hid_sensor_als hid_sensor_trigger snd_seq mac80211 snd_seq_device hid_sensor_iio_common snd_pci_acp6x rapl industrialio_triggered_buffer snd_pci_acp5x snd_pcm kfifo_buf libarc4 snd>
Jan 26 13:50:43 kronk kernel: CPU: 6 PID: 22763 Comm: kworker/u32:30 Tainted: P           O       6.6.12-200.fc39.x86_64 #1
Jan 26 13:50:43 kronk kernel: Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.03 10/17/2023
Jan 26 13:50:43 kronk kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jan 26 13:50:43 kronk kernel: RIP: 0010:dcn314_smu_send_msg_with_param+0x108/0x180 [amdgpu]
Jan 26 13:50:43 kronk kernel: Code: be 93 62 01 00 5d 41 5c 41 5d e9 13 21 f0 ff 44 89 ea 48 c7 c6 30 cb fd c0 48 c7 c7 a0 1f b8 c0 e8 9d 80 f0 cb e9 48 ff ff ff <0f> 0b 48 8b 3b b9 80 84 1e 00 44 89 e2 89 ee e8 04 b3 f0 >
Jan 26 13:50:43 kronk kernel: RSP: 0018:ffffc9000b68f988 EFLAGS: 00010246
Jan 26 13:50:43 kronk kernel: RAX: 0000185dca233cde RBX: ffff888100e2d400 RCX: 0000000000000006
Jan 26 13:50:43 kronk kernel: RDX: 0000000000008907 RSI: 00000000000080a9 RDI: 0000185dca22b3d7
Jan 26 13:50:43 kronk kernel: RBP: 0000000000000012 R08: ffffc9000b68f99c R09: 0000000000000009
Jan 26 13:50:43 kronk kernel: R10: 0000000000000000 R11: 00000000000001d3 R12: 0000000000000007
Jan 26 13:50:43 kronk kernel: R13: 0000000000000000 R14: ffffc9000b68f9b8 R15: 0000000000000009
Jan 26 13:50:43 kronk kernel: FS:  0000000000000000(0000) GS:ffff888f61f80000(0000) knlGS:0000000000000000
Jan 26 13:50:43 kronk kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 26 13:50:43 kronk kernel: CR2: 00007f2d6bad5c50 CR3: 0000000861222000 CR4: 0000000000f50ee0
Jan 26 13:50:43 kronk kernel: PKRU: 55555554
Jan 26 13:50:43 kronk kernel: Call Trace:
Jan 26 13:50:43 kronk kernel:  <TASK>
Jan 26 13:50:43 kronk kernel:  ? dcn314_smu_send_msg_with_param+0x108/0x180 [amdgpu]
Jan 26 13:50:43 kronk kernel:  ? __warn+0x81/0x130
Jan 26 13:50:43 kronk kernel:  ? dcn314_smu_send_msg_with_param+0x108/0x180 [amdgpu]
Jan 26 13:50:43 kronk kernel:  ? report_bug+0x171/0x1a0
Jan 26 13:50:43 kronk kernel:  ? handle_bug+0x3c/0x80
Jan 26 13:50:43 kronk kernel:  ? exc_invalid_op+0x17/0x70
Jan 26 13:50:43 kronk kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jan 26 13:50:43 kronk kernel:  ? dcn314_smu_send_msg_with_param+0x108/0x180 [amdgpu]
Jan 26 13:50:43 kronk kernel:  ? dcn314_smu_send_msg_with_param+0xae/0x180 [amdgpu]
Jan 26 13:50:43 kronk kernel:  dcn314_update_clocks+0x3db/0x480 [amdgpu]
Jan 26 13:50:43 kronk kernel:  dcn20_optimize_bandwidth+0xff/0x1e0 [amdgpu]
Jan 26 13:50:43 kronk kernel:  dc_commit_state_no_check+0xb77/0xe20 [amdgpu]
Jan 26 13:50:43 kronk kernel:  dc_commit_streams+0x29b/0x400 [amdgpu]
Jan 26 13:50:43 kronk kernel:  dm_suspend+0x1b8/0x1d0 [amdgpu]
Jan 26 13:50:43 kronk kernel:  amdgpu_device_ip_suspend_phase1+0x6e/0xe0 [amdgpu]
Jan 26 13:50:43 kronk kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jan 26 13:50:43 kronk kernel:  amdgpu_device_ip_suspend+0x1f/0x70 [amdgpu]
Jan 26 13:50:43 kronk kernel:  amdgpu_device_pre_asic_reset+0xd3/0x2a0 [amdgpu]
Jan 26 13:50:43 kronk kernel:  amdgpu_device_gpu_recover+0x4c6/0xd80 [amdgpu]
Jan 26 13:50:43 kronk kernel:  amdgpu_job_timedout+0x186/0x270 [amdgpu]
Jan 26 13:50:43 kronk kernel:  ? finish_task_switch.isra.0+0x94/0x2f0
Jan 26 13:50:43 kronk kernel:  drm_sched_job_timedout+0x77/0x110 [gpu_sched]
Jan 26 13:50:43 kronk kernel:  process_one_work+0x171/0x340
Jan 26 13:50:43 kronk kernel:  worker_thread+0x27b/0x3a0
Jan 26 13:50:43 kronk kernel:  ? __pfx_worker_thread+0x10/0x10
Jan 26 13:50:43 kronk kernel:  kthread+0xe5/0x120
Jan 26 13:50:43 kronk kernel:  ? __pfx_kthread+0x10/0x10
Jan 26 13:50:43 kronk kernel:  ret_from_fork+0x31/0x50
Jan 26 13:50:43 kronk kernel:  ? __pfx_kthread+0x10/0x10
Jan 26 13:50:43 kronk kernel:  ret_from_fork_asm+0x1b/0x30
Jan 26 13:50:43 kronk kernel:  </TASK>
Jan 26 13:50:43 kronk kernel: ---[ end trace 0000000000000000 ]---
Jan 26 13:50:45 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:45 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:45 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:45 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:45 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:45 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:46 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:46 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:46 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:46 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:46 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 26 13:50:46 kronk kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 26 13:50:46 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
Jan 26 13:50:51 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 26 13:50:51 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Mode2 reset failed!
Jan 26 13:50:51 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:c1:00.0
Jan 26 13:50:51 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 26 13:50:51 kronk kernel: [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
Jan 26 13:50:51 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Jan 26 13:50:56 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 26 13:50:56 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Failed to SetDriverDramAddr!
Jan 26 13:50:56 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Failed to setup smc hw!
Jan 26 13:50:56 kronk kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
Jan 26 13:50:56 kronk kernel: [drm] Skip scheduling IBs!
Jan 26 13:50:56 kronk kernel: [drm] Skip scheduling IBs!
Jan 26 13:50:56 kronk kernel: [drm] Skip scheduling IBs!
Jan 26 13:50:56 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: (-95) failed to switch to video power profile mode
Jan 26 13:50:56 kronk kernel: [drm] Skip scheduling IBs!
Jan 26 13:50:57 kronk kernel: show_signal_msg: 166768 callbacks suppressed
Jan 26 13:50:57 kronk kernel: firefox-bi:cs0[24983]: segfault at 0 ip 000055e7fe34c62a sp 00007f903d9fea00 error 6 in firefox-bin[55e7fe2cc000+bd000] likely on CPU 5 (core 2, socket 0)
Jan 26 13:50:57 kronk kernel: Code: 41 56 53 50 48 89 fb 4c 8b 35 42 e5 03 00 49 8b 36 e8 5a c0 03 00 49 8b 36 bf 0a 00 00 00 e8 3d c1 03 00 48 89 1d 6e 17 04 00 <c7> 04 25 00 00 00 00 23 00 00 00 e8 e6 49 fc ff cc cc cc >
Jan 26 13:50:58 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: (-95) failed to disable video power profile mode

Out of the box, fully updated, should be good to go.

Tainted: P O 6.6.12-200.fc39.x86_64 #1

This feels like something out of the ordinary was changed outside of a standard install.