Suspend/resume crashing amdgpu and taking ~20s to recover

Starting a few weeks ago, when I resume from suspend, the screen powers on, shows the desktop (not the lock screen), then everything freezes for 10-30 seconds. After that time, the lockscreen appears and works as expected.

System: Framework Laptop 16 / firmware 0.0.3.4
Fedora 41 / Linux Kernel 6.13.7 (same behaviour on 6.14)

journalctl after resume first shows this:

Summary

Apr 01 14:27:44 joshua kernel: amdgpu 0000:03:00.0: amdgpu: MES FW versoin must be larger than 0x63 to support limit single process feature.
Apr 01 14:27:44 joshua kernel: amdgpu 0000:03:00.0: amdgpu: failed to change_config.
Apr 01 14:27:44 joshua kernel: amdgpu 0000:03:00.0: amdgpu: resume of IP block <mes_v11_0> failed -22
Apr 01 14:27:44 joshua kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
Apr 01 14:27:44 joshua kernel: amdgpu 0000:03:00.0: PM: dpm_run_callback(): pci_pm_resume returns -22
Apr 01 14:27:44 joshua kernel: amdgpu 0000:03:00.0: PM: failed to resume async: error -22

and seconds later this:

Summary

Apr 01 14:27:51 joshua kernel: ------------[ cut here ]------------
Apr 01 14:27:51 joshua kernel: WARNING: CPU: 13 PID: 91770 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:3073 dm_suspend+0x274/0x2e0 [amdgpu]
Apr 01 14:27:51 joshua kernel: Modules linked in: uinput uhid rfcomm snd_seq_dummy snd_hrtimer nls_utf8 cifs cifs_arc4 nls_ucs2_utils cifs_md4 dns_resolver netfs nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc binfmt_misc snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi vfat fat squashfs snd_sof_amd_acp70 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_pci_ps leds_cros_ec cros_usbpd_charger snd_soc_acpi_amd_match led_class_multicolor cros_charge_control gpio_cros_ec cros_ec_hwmon cros_usbpd_logger cros_ec_sysfs cros_ec_chardev cros_usbpd_notify snd_amd_sdw_acpi soundwire_amd soundwire_generic_allocation cros_ec_dev soundwire_bus mt7921e intel_rapl_msr amd_atl snd_soc_sdca mt7921_common
Apr 01 14:27:51 joshua kernel: intel_rapl_common snd_soc_core mt792x_lib snd_usb_audio btusb edac_mce_amd snd_hda_intel btrtl mt76_connac_lib snd_compress snd_intel_dspcfg btintel ac97_bus snd_intel_sdw_acpi snd_pcm_dmaengine mt76 btbcm snd_rpl_pci_acp6x snd_usbmidi_lib snd_hda_codec snd_acp_pci btmtk snd_ump cros_ec_lpcs kvm_amd snd_acp_legacy_common spd5118 cros_ec bluetooth snd_rawmidi snd_hda_core snd_pci_acp6x kvm mac80211 mc snd_hwdep hid_sensor_als hid_sensor_trigger snd_pci_acp5x hid_sensor_iio_common industrialio_triggered_buffer kfifo_buf snd_seq snd_rn_pci_acp3x rapl libarc4 snd_acp_config industrialio snd_seq_device wmi_bmof pcspkr snd_soc_acpi i2c_piix4 thunderbolt snd_pcm cfg80211 k10temp snd_pci_acp3x i2c_smbus snd_timer amd_pmf snd amdtee rfkill soundcore amd_sfh tee platform_profile joydev amd_pmc loop nfnetlink zram lz4hc_compress lz4_compress dm_crypt typec_displayport amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper nvme drm_panel_backlight_quirks drm_buddy crct10dif_pclmul
Apr 01 14:27:51 joshua kernel: nvme_core crc32_pclmul drm_display_helper crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel video ucsi_acpi hid_sensor_hub sha512_ssse3 hid_multitouch sha256_ssse3 sha1_ssse3 typec_ucsi cec typec sp5100_tco nvme_auth wmi i2c_hid_acpi i2c_hid fuse i2c_dev
Apr 01 14:27:51 joshua kernel: CPU: 13 UID: 0 PID: 91770 Comm: kworker/13:1 Tainted: G W 6.13.7-200.fc41.x86_64 #1
Apr 01 14:27:51 joshua kernel: Tainted: [W]=WARN
Apr 01 14:27:51 joshua kernel: Hardware name: Framework Laptop 16 (AMD Ryzen 7040 Series)/FRANMZCP07, BIOS 03.04 07/09/2024
Apr 01 14:27:51 joshua kernel: Workqueue: pm pm_runtime_work
Apr 01 14:27:51 joshua kernel: RIP: 0010:dm_suspend+0x274/0x2e0 [amdgpu]
Apr 01 14:27:51 joshua kernel: Code: 08 31 04 00 e9 66 fe ff ff 41 0f b6 84 24 a0 02 00 00 48 8d 74 24 10 4c 89 ef 4c 89 64 24 10 88 44 24 18 e8 ee 3a 4c 00 eb a2 <0f> 0b e9 d7 fd ff ff 48 c7 c7 20 e8 f7 c0 e8 39 86 0f f2 e9 77 ff
Apr 01 14:27:51 joshua kernel: RSP: 0018:ffffa640cace3c48 EFLAGS: 00010286
Apr 01 14:27:51 joshua kernel: RAX: 0000000000000000 RBX: ffff931f9c000000 RCX: 000000000000000d
Apr 01 14:27:51 joshua kernel: RDX: 0000000000000000 RSI: 000000000001629a RDI: ffff931f9c000000
Apr 01 14:27:51 joshua kernel: RBP: ffff931f9c0455b8 R08: 0000000000000000 R09: 0000000000000001
Apr 01 14:27:51 joshua kernel: R10: 0000000000380002 R11: 00000000ffffffff R12: 0000000000000005
Apr 01 14:27:51 joshua kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff931fec533040
Apr 01 14:27:51 joshua kernel: FS: 0000000000000000(0000) GS:ffff9326de880000(0000) knlGS:0000000000000000
Apr 01 14:27:51 joshua kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 01 14:27:51 joshua kernel: CR2: 00007f0441962008 CR3: 00000001bb82c000 CR4: 0000000000f50ef0
Apr 01 14:27:51 joshua kernel: PKRU: 55555554
Apr 01 14:27:51 joshua kernel: Call Trace:
Apr 01 14:27:51 joshua kernel:
Apr 01 14:27:51 joshua kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Apr 01 14:27:51 joshua kernel: ? show_trace_log_lvl+0x255/0x2f0
Apr 01 14:27:51 joshua kernel: ? show_trace_log_lvl+0x255/0x2f0
Apr 01 14:27:51 joshua kernel: ? amdgpu_ip_block_suspend+0x24/0x40 [amdgpu]
Apr 01 14:27:51 joshua kernel: ? dm_suspend+0x274/0x2e0 [amdgpu]
Apr 01 14:27:51 joshua kernel: ? __warn.cold+0x93/0xfa
Apr 01 14:27:51 joshua kernel: ? dm_suspend+0x274/0x2e0 [amdgpu]
Apr 01 14:27:51 joshua kernel: ? report_bug+0xff/0x140
Apr 01 14:27:51 joshua kernel: ? handle_bug+0x58/0x90
Apr 01 14:27:51 joshua kernel: ? exc_invalid_op+0x17/0x70
Apr 01 14:27:51 joshua kernel: ? asm_exc_invalid_op+0x1a/0x20
Apr 01 14:27:51 joshua kernel: ? dm_suspend+0x274/0x2e0 [amdgpu]
Apr 01 14:27:51 joshua kernel: ? dm_suspend+0x3c/0x2e0 [amdgpu]
Apr 01 14:27:51 joshua kernel: ? smu_cmn_send_smc_msg_with_param+0x1ec/0x500 [amdgpu]
Apr 01 14:27:51 joshua kernel: amdgpu_ip_block_suspend+0x24/0x40 [amdgpu]
Apr 01 14:27:51 joshua kernel: amdgpu_device_ip_suspend_phase1+0x89/0xe0 [amdgpu]
Apr 01 14:27:51 joshua kernel: amdgpu_device_suspend+0x74/0x170 [amdgpu]
Apr 01 14:27:51 joshua kernel: amdgpu_pmops_runtime_suspend+0xb9/0x1a0 [amdgpu]
Apr 01 14:27:51 joshua kernel: pci_pm_runtime_suspend+0x67/0x1a0
Apr 01 14:27:51 joshua kernel: ? __pfx_pci_pm_runtime_suspend+0x10/0x10
Apr 01 14:27:51 joshua kernel: __rpm_callback+0x41/0x170
Apr 01 14:27:51 joshua kernel: ? __pfx_pci_pm_runtime_suspend+0x10/0x10
Apr 01 14:27:51 joshua kernel: rpm_callback+0x55/0x60
Apr 01 14:27:51 joshua kernel: ? __pfx_pci_pm_runtime_suspend+0x10/0x10
Apr 01 14:27:51 joshua kernel: rpm_suspend+0xe6/0x5f0
Apr 01 14:27:51 joshua kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Apr 01 14:27:51 joshua kernel: ? finish_task_switch.isra.0+0x99/0x2c0
Apr 01 14:27:51 joshua kernel: pm_runtime_work+0x98/0xb0
Apr 01 14:27:51 joshua kernel: process_one_work+0x176/0x330
Apr 01 14:27:51 joshua kernel: worker_thread+0x252/0x390
Apr 01 14:27:51 joshua kernel: ? __pfx_worker_thread+0x10/0x10
Apr 01 14:27:51 joshua kernel: kthread+0xcf/0x100
Apr 01 14:27:51 joshua kernel: ? __pfx_kthread+0x10/0x10
Apr 01 14:27:51 joshua kernel: ret_from_fork+0x31/0x50
Apr 01 14:27:51 joshua kernel: ? __pfx_kthread+0x10/0x10
Apr 01 14:27:51 joshua kernel: ret_from_fork_asm+0x1a/0x30
Apr 01 14:27:51 joshua kernel:
Apr 01 14:27:51 joshua kernel: —[ end trace 0000000000000000 ]—
Apr 01 14:27:51 joshua kernel: ------------[ cut here ]------------

several additional kernel traces are following, then this:

Summary

Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Apr 01 14:28:15 joshua kernel: [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: PSP is resuming…
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x81fc000000 for PSP TMR
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming…
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000035, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00525c00 (82.92.0)
Apr 01 14:28:15 joshua kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Apr 01 14:28:16 joshua kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!

The system then became responsive again at 14:28:30.

I first suspected 6.14 kernels (which I’ve been trying from rc1) as the culprit, but the same problem occurs with 6.13.7. Firmware is up to date from linux-firmware git repo.

@Mario_Limonciello I thought you might have ideas on how to best investigate this issue? Should I report it somewhere else? Thanks for any help!

This could be related to a dGPU suspend issue introduced in 6.13.6 seen here: Graphics card not available - Framework Laptop 16 / Linux - Framework Community

If you need the device working now, you can roll back to 6.13.5 which multiple users (myself included) have reported as being the last working version.

Thanks a lot. I am reading the forum regularly but not all posts, so I had not noticed people in that other thread were getting the same errors.
You are correct that it might have the same cause. Only the symptoms here are different, the dGPU does not effectively disappear (though something is really wrong for a while after suspend/resume)

When I first encountered the issue I didn’t find the thread either despite being active and having searched for it, and I created a duplicate thread (that has since been merged into it) so there’s no shade being thrown from my direction. I’ve noticed that the issue can be a bit inconsistent, sometimes the dGPU will disappear and sometimes it won’t. My situation didn’t line up 100% with what was happening in that thread either, but their fix did fix my issue as well, so I’d say it’s probably worth giving 6.13.5 a shot on your device.

1 Like

I have downgraded to 6.13.5, there is no more crash in the system log, but the freeze still happens.

Try this series

And if it helps you can leave comments here: https://gitlab.freedesktop.org/drm/amd/-/issues/4083

2 Likes

Unfortunately my problem is not fixed by this patch :frowning:

There is no more crash in amdgpu, but the delay is still there. So that crash was probably unrelated to the freezing problem. I’m playing around more trying to find the actual problem …

I’ve seen similar issues when trying to resume from suspend. To be entirely honest, I never found a fix and just stopped using sleep. I only recently started giving it another shot now that I’m on Fedora and it supposedly “works out of the box”. If you narrow the issue down any further I’d be interested.

I’ll do a clean reinstall when Fedora 42 is released. Then I can see if the problem is also present with a fresh installation and file a better bug report.

It might also be caused by some customisation I did for my local system - I’m using a few special applications so maybe something I did while trying to get those to run is the culprit …