Hi there,
I received my system two weeks ago and had this error using Windows 11 and 10 and now with Linux. I tested with 64 GB RAM and with my (preferred configuration, 96 GB).
When utilizing VMs (VMware or VirtualBox) (with or without 3d support enabled). The amdgpu
driver crashes consistently after more than two VMs running after a couple of minutes of using the VMs.
I already experimented with different kernel settings (like restricting the GPU VRAM to 2048 GB while at the same time activating Gaming mode in BIOS to ensure 4096 being initially allocated), but nothing seems to really matter. The crashes keep happening. After the driver crash the system is still reachable via ssh. I can interact with it, but powering down fully doesn’t work. It keeps being “on” but unresponsive. I need to long-press the Power-button to shut it down and boot it up again.
My GRUB command line (at the moment is):
GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.power_dpm_state=performance amdgpu.power_dpm_force_performance_level=high amdgpu.gpu_recovery=1 amd_pstate=active rtc_cmos.use_acpi_alarm=1 pcie_aspm=off"
The kernel is:
Linux systemName 6.5.0-1011-oem #12-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 3 20:17:42 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
The latest crash was:
Jan 9 14:43:59 systemName kernel: [ 1064.894981] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.03 10/17/2023
Jan 13 11:02:24 systemName kernel: [ 1490.004412] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=108303, emitted seq=108305
Jan 13 11:02:24 systemName kernel: [ 1490.005060] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2685 thread Xorg:cs0 pid 2765
Jan 13 11:02:24 systemName kernel: [ 1490.005256] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
Jan 13 11:02:24 systemName kernel: [ 1490.010556] amdgpu_cs_ioctl: 40 callbacks suppressed
Jan 13 11:02:24 systemName kernel: [ 1490.010559] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 13 11:02:29 systemName kernel: [ 1494.832181] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 13 11:02:29 systemName kernel: [ 1494.832191] amdgpu 0000:c1:00.0: amdgpu: Failed to disable gfxoff!
Jan 13 11:02:31 systemName kernel: [ 1497.101217] ------------[ cut here ]------------
Jan 13 11:02:31 systemName kernel: [ 1497.101223] WARNING: CPU: 12 PID: 4374 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn314/dcn314_smu.c:159 dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.101557] Modules linked in: xt_MASQUERADE xt_tcpudp xt_mark nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ccm rfcomm cmac algif_hash algif_skcipher af_alg vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nf_tables nfnetlink overlay bnep snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp intel_rapl_msr snd_sof_pci intel_rapl_common snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic snd_compress binfmt_misc btusb ledtrig_audio ac97_bus snd_hda_codec_hdmi mt7921e mt7921_common btrtl snd_pcm_dmaengine edac_mce_amd snd_hda_intel mt76_connac_lib btbcm snd_pci_ps snd_intel_dspcfg kvm_amd mt76 btintel snd_rpl_pci_acp6x snd_intel_sdw_acpi snd_hda_codec snd_acp_pci btmtk hid_sensor_als nls_iso8859_1 snd_pci_acp6x hid_sensor_trigger kvm snd_hda_core mac80211 bluetooth snd_pci_acp5x industrialio_triggered_buffer snd_hwdep snd_rn_pci_acp3x kfifo_buf snd_pcm input_leds irqbypass snd_acp_config ecdh_generic hid_sensor_iio_common cfg80211 ecc rapl snd_timer
Jan 13 11:02:31 systemName kernel: [ 1497.101627] serio_raw snd_soc_acpi joydev industrialio hid_multitouch k10temp snd ccp snd_pci_acp3x libarc4 soundcore mac_hid amd_pmf amd_pmc platform_profile sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua parport_pc ppdev lp parport efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu amdxcp iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm crct10dif_pclmul drm_display_helper crc32_pclmul cec polyval_clmulni rc_core polyval_generic hid_sensor_hub ghash_clmulni_intel drm_kms_helper hid_generic aesni_intel drm nvme cros_ec_lpcs ucsi_acpi crypto_simd i2c_hid_acpi xhci_pci nvme_core cros_ec typec_ucsi cryptd video thunderbolt i2c_piix4 i2c_hid xhci_pci_renesas nvme_common typec wmi hid
Jan 13 11:02:31 systemName kernel: [ 1497.101708] CPU: 12 PID: 4374 Comm: kworker/u32:2 Tainted: G O 6.5.0-1011-oem #12-Ubuntu
Jan 13 11:02:31 systemName kernel: [ 1497.101714] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jan 13 11:02:31 systemName kernel: [ 1497.101724] RIP: 0010:dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.101947] Code: 41 5e 5d 31 d2 31 c9 31 f6 31 ff e9 cd 59 7b cc 89 da 48 c7 c6 78 8f 58 c1 48 c7 c7 b0 bc 17 c1 e8 f8 c9 f2 cb e9 37 ff ff ff <0f> 0b 49 8b 3c 24 b9 80 84 1e 00 44 89 f2 44 89 ee e8 ad 12 df ff
Jan 13 11:02:31 systemName kernel: [ 1497.101950] RSP: 0018:ffff9d964372b8b0 EFLAGS: 00010246
Jan 13 11:02:31 systemName kernel: [ 1497.101953] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101955] RBP: ffff9d964372b8d0 R08: 0000000000000000 R09: 0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101956] R10: 0000000000000000 R11: 0000000000000000 R12: ffff922681097800
Jan 13 11:02:31 systemName kernel: [ 1497.101957] R13: 0000000000000015 R14: 0000000000000500 R15: ffff9226a11a0000
Jan 13 11:02:31 systemName kernel: [ 1497.101959] FS: 0000000000000000(0000) GS:ffff923cc2100000(0000) knlGS:0000000000000000
Jan 13 11:02:31 systemName kernel: [ 1497.101961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 13 11:02:31 systemName kernel: [ 1497.101962] CR2: 0000563662411548 CR3: 0000000195e8a000 CR4: 0000000000750ee0
Jan 13 11:02:31 systemName kernel: [ 1497.101964] PKRU: 55555554
Jan 13 11:02:31 systemName kernel: [ 1497.101965] Call Trace:
Jan 13 11:02:31 systemName kernel: [ 1497.101967] <TASK>
Jan 13 11:02:31 systemName kernel: [ 1497.101971] ? show_regs+0x6d/0x80
Jan 13 11:02:31 systemName kernel: [ 1497.101978] ? __warn+0x89/0x160
Jan 13 11:02:31 systemName kernel: [ 1497.101983] ? dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102201] ? report_bug+0x17e/0x1b0
Jan 13 11:02:31 systemName kernel: [ 1497.102209] ? handle_bug+0x46/0x90
Jan 13 11:02:31 systemName kernel: [ 1497.102214] ? exc_invalid_op+0x18/0x80
Jan 13 11:02:31 systemName kernel: [ 1497.102217] ? asm_exc_invalid_op+0x1b/0x20
Jan 13 11:02:31 systemName kernel: [ 1497.102224] ? dcn314_smu_send_msg_with_param+0x11d/0x1a0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102450] dcn314_smu_set_zstate_support+0x42/0x60 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102662] dcn314_update_clocks+0x473/0x550 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.102869] ? srso_alias_return_thunk+0x5/0x7f
Jan 13 11:02:31 systemName kernel: [ 1497.102874] ? dm_read_reg_func+0x60/0xf0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103090] dcn20_optimize_bandwidth+0x13e/0x290 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103305] dc_commit_state_no_check+0x91d/0xd30 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103509] dc_commit_streams+0x311/0x6c0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103711] dm_suspend+0x202/0x260 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.103923] amdgpu_device_ip_suspend_phase1+0xb2/0x1c0 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104066] amdgpu_device_ip_suspend+0x20/0x80 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104208] amdgpu_device_pre_asic_reset+0xd4/0x490 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104351] amdgpu_device_gpu_recover+0x4ad/0xa70 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104494] amdgpu_job_timedout+0x182/0x270 [amdgpu]
Jan 13 11:02:31 systemName kernel: [ 1497.104689] drm_sched_job_timedout+0x6d/0x120 [gpu_sched]
Jan 13 11:02:31 systemName kernel: [ 1497.104696] process_one_work+0x23d/0x450
Jan 13 11:02:31 systemName kernel: [ 1497.104702] worker_thread+0x50/0x3f0
Jan 13 11:02:31 systemName kernel: [ 1497.104704] ? srso_alias_return_thunk+0x5/0x7f
Jan 13 11:02:31 systemName kernel: [ 1497.104706] ? __pfx_worker_thread+0x10/0x10
Jan 13 11:02:31 systemName kernel: [ 1497.104708] kthread+0xef/0x120
Jan 13 11:02:31 systemName kernel: [ 1497.104712] ? __pfx_kthread+0x10/0x10
Jan 13 11:02:31 systemName kernel: [ 1497.104715] ret_from_fork+0x44/0x70
Jan 13 11:02:31 systemName kernel: [ 1497.104720] ? __pfx_kthread+0x10/0x10
Jan 13 11:02:31 systemName kernel: [ 1497.104722] ret_from_fork_asm+0x1b/0x30
Jan 13 11:02:31 systemName kernel: [ 1497.104728] </TASK>
Jan 13 11:02:31 systemName kernel: [ 1497.104729] ---[ end trace 0000000000000000 ]---
Jan 13 11:02:31 systemName kernel: [ 1497.200178] VirtualBoxVM[3941]: segfault at 0 ip 00007f53472f14e0 sp 00007fff5bd8a370 error 4
Jan 13 11:02:31 systemName kernel: [ 1497.200189] VirtualBoxVM[4216]: segfault at 0 ip 00007f1478af14e0 sp 00007ffec11e1750 error 4
Jan 13 11:02:31 systemName kernel: [ 1497.200190] in libQt5Gui.so.5.15.3[7f53472e2000+4df000] likely on CPU 0 (core 0, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200195] in libQt5Gui.so.5.15.3[7f1478ae2000+4df000]
Jan 13 11:02:31 systemName kernel: [ 1497.200196] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.200198] likely on CPU 4 (core 2, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200200] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.200357] VirtualBoxVM[4013]: segfault at 0 ip 00007f69414f14e0 sp 00007fffeda92480 error 4 in libQt5Gui.so.5.15.3[7f69414e2000+4df000] likely on CPU 5 (core 2, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200368] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.200459] VirtualBoxVM[4078]: segfault at 0 ip 00007f0b572f14e0 sp 00007ffd24617510 error 4 in libQt5Gui.so.5.15.3[7f0b572e2000+4df000] likely on CPU 4 (core 2, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.200467] Code: 89 e7 48 8d 35 b9 cb 4d 00 0f 11 44 24 08 48 89 04 24 48 8d 05 fc 0d 4d 00 48 89 44 24 18 31 c0 e8 05 7f ff ff e9 60 70 35 00 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 04 25 00 00 00 00 0f 0b 48 8b
Jan 13 11:02:31 systemName kernel: [ 1497.206206] VirtualBoxVM[4146]: segfault at 0 ip 00007fda09aeeb08 sp 00007ffedf2d67b8 error 4 in libQt5Gui.so.5.15.3[7fda09ae2000+4df000] likely on CPU 15 (core 7, socket 0)
Jan 13 11:02:31 systemName kernel: [ 1497.206222] Code: 4d 00 48 89 44 24 18 31 c0 e8 f4 a8 ff ff 48 8b 44 24 28 64 48 2b 04 25 28 00 00 00 74 05 e8 ef 88 ff ff 31 c0 48 83 c4 38 c3 <48> 8b 04 25 00 00 00 00 0f 0b 48 8b 05 37 36 4d 00 66 0f ef c0 48
Jan 13 11:02:36 systemName kernel: [ 1501.829121] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1501.829396] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1501.957658] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1501.957783] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.086064] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.086183] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.214428] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.214546] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.342772] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.342993] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.471236] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.471355] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.599593] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.599709] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:36 systemName kernel: [ 1502.727935] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:36 systemName kernel: [ 1502.728053] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:37 systemName kernel: [ 1502.856283] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Jan 13 11:02:37 systemName kernel: [ 1502.856399] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Jan 13 11:02:37 systemName kernel: [ 1502.857891] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
Jan 13 11:02:41 systemName kernel: [ 1507.749339] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 13 11:02:41 systemName kernel: [ 1507.749349] amdgpu 0000:c1:00.0: amdgpu: Mode2 reset failed!
Jan 13 11:02:41 systemName kernel: [ 1507.749353] amdgpu 0000:c1:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:c1:00.0
Jan 13 11:02:41 systemName kernel: [ 1507.749417] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 13 11:02:41 systemName kernel: [ 1507.749930] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
Jan 13 11:02:41 systemName kernel: [ 1507.750038] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
Jan 13 11:02:46 systemName kernel: [ 1512.673746] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Jan 13 11:02:46 systemName kernel: [ 1512.673753] amdgpu 0000:c1:00.0: amdgpu: Failed to SetDriverDramAddr!
Jan 13 11:02:46 systemName kernel: [ 1512.673755] amdgpu 0000:c1:00.0: amdgpu: Failed to setup smc hw!
Jan 13 11:02:46 systemName kernel: [ 1512.673758] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
As I said it is (mostly) triggerable with VMware Workstation 17 and VirtualBox (not installed at the same time, before you ask ;-)). But also using other GPU related applications can trigger it.
I bought two 64 GB bundles to verify and two 96 GB bundles before running memcheck86+ multiple times to ensure the RAM is not faulty. Do you have any suggestions what I should do?
Thank you very much!
marpie