[Arch/amdgpu] Screen freeze on zen kernel / new warnings on vanilla kernel - 6.12.5

Edit: disclaimer - it seems that complete freeze happens on -zen family of kernels, but part of this post (with the other warning printed by the kernel) is still applicable, thus leaving it anyway for better visibility. Especially given that -zen kernels are quite popular.

Which Linux distro are you using?
Arch

If rolling release, last date updated?
Just a moment ago.

Which kernel are you using?
6.12.5-zen

Which BIOS version are you using?
3.05

Which Framework Laptop 16 model are you using? (AMD Ryzen™ 7040 Series)
Ryzen 7 7840HS

I know that issues with Panel Replay / PSR are ongoing for some time already, but this one seem to be more problematic and I want to give everyone a heads up before upgrading to the newest kernel. Instead of having occasional refresh issues, this time screen freezes as soon as Plasma loads (Adaptive Sync enabled in Plasma settings, I guess it happens when it triggers it).

The problem goes away if I set amdgpu.dcdebugmask=0x610 (which effectively disables PSR + Panel Replay, I guess 0x400 would be sufficient though).

Following kernel warnings show up (first one shows even with PR disabled, second one is probably related to the screen hang and prints out only when PR is enabled):

  • one that shows up always (on both, vanilla and zen kernels, but doesn’t seem to cause any negative effects when PR is disabled)
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 12 at drivers/gpu/drm/amd/amdgpu/../display/dc/dpp/dcn30/dcn30_dpp.c:534 dpp3_deferred_update+0x101/0x330 [amdgpu]
kernel: Modules linked in: snd_seq_dummy rfcomm snd_hrtimer snd_seq snd_seq_device typec_displayport cmac algif_hash algif_skcipher af_alg bnep vfat fat ch341 snd_sof_amd_acp70 snd_sof_amd_acp63 snd_soc_acpi
kernel:  hid_sensor_als snd_pcm_dmaengine hid_sensor_trigger snd_hda_codec snd_rpl_pci_acp6x industrialio_triggered_buffer iwlwifi snd_acp_pci kvm_amd kfifo_buf snd_hda_core snd_acp_legacy_common hid_sensor_
kernel:  aesni_intel drm_buddy nvme gf128mul drm_display_helper crypto_simd cryptd nvme_core cec ccp video crc16 nvme_auth wmi
kernel: CPU: 2 UID: 0 PID: 12 Comm: kworker/u64:1 Not tainted 6.12.5-zen1-1-zen #1 21e0c68887ee4451405af53cfb8d1434c391cada
kernel: Hardware name: Framework Laptop 16 (AMD Ryzen 7040 Series)/FRANMZCP07, BIOS 03.05 11/13/2024
kernel: Workqueue: events_unbound commit_work
kernel: RIP: 0010:dpp3_deferred_update+0x101/0x330 [amdgpu]
kernel: Code: 83 78 e1 00 00 0f b6 90 a8 02 00 00 48 8b 83 70 e1 00 00 8b b0 78 04 00 00 e8 db ad 12 00 8b 74 24 04 85 f6 0f 84 5d 01 00 00 <0f> 0b 0f b6 83 48 96 00 00 83 e0 f7 88 83 48 96 00 00 a8 01 0f 84
kernel: RSP: 0018:ffffbd4300187a40 EFLAGS: 00010202
kernel: RAX: 0000000000000066 RBX: ffff9d9bd21e0000 RCX: 0000000000000004
kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff9d9bd2080000
kernel: RBP: ffff9d9be3f00000 R08: ffffbd4300187a44 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: ffffbd43001879ec R12: 0000000000000000
kernel: R13: ffff9d9be3f00308 R14: ffff9d9be3f05dc8 R15: ffff9d9bcd866000
kernel: FS:  0000000000000000(0000) GS:ffff9da925b00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000076d2d5ec2bb0 CR3: 0000000dbf822000 CR4: 0000000000f50ef0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? dpp3_deferred_update+0x101/0x330 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? __warn.cold+0x93/0xed
kernel:  ? dpp3_deferred_update+0x101/0x330 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? report_bug+0xe7/0x210
kernel:  ? handle_bug+0x58/0x90
kernel:  ? exc_invalid_op+0x19/0xc0
kernel:  ? asm_exc_invalid_op+0x1a/0x20
kernel:  ? dpp3_deferred_update+0x101/0x330 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  dc_post_update_surfaces_to_stream+0x24f/0x470 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  amdgpu_dm_commit_planes+0x1379/0x1fe0 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  amdgpu_dm_atomic_commit_tail+0x1312/0x3170 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? dma_fence_default_wait+0x8b/0x240
kernel:  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? kvfree_call_rcu+0x26e/0x350
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? wait_for_completion_timeout+0x130/0x180
kernel:  commit_tail+0x91/0x130
kernel:  process_one_work+0x18f/0x350
kernel:  worker_thread+0x24c/0x380
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xcf/0x100
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x31/0x50
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1a/0x30
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---

  • shows after the above one when PR is disabled on zen kernel
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 10 PID: 215 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_replay.c:89 dmub_replay_enable+0xfe/0x170 [amdgpu]
kernel: Modules linked in: ccm snd_seq_dummy rfcomm snd_hrtimer snd_seq snd_seq_device typec_displayport cmac algif_hash algif_skcipher af_alg bnep vfat fat ch341 snd_sof_amd_acp70 snd_sof_amd_acp63 snd_soc_
kernel:  cros_ec_chardev kvm_amd cros_usbpd_notify cros_ec_debugfs snd_hda_core snd_acp_legacy_common mousedev cros_ec_sysfs cros_usbpd_logger gpio_cros_ec led_class_multicolor btmtk hid_sensor_iio_common bl
kernel:  aesni_intel drm_buddy nvme gf128mul drm_display_helper crypto_simd nvme_core cryptd cec ccp video crc16 nvme_auth wmi
kernel: CPU: 10 UID: 0 PID: 215 Comm: kworker/u64:3 Tainted: G        W          6.12.5-zen1-1-zen #1 21e0c68887ee4451405af53cfb8d1434c391cada
kernel: Tainted: [W]=WARN
kernel: Hardware name: Framework Laptop 16 (AMD Ryzen 7040 Series)/FRANMZCP07, BIOS 03.05 11/13/2024
kernel: Workqueue: events_unbound commit_work
kernel: RIP: 0010:dmub_replay_enable+0xfe/0x170 [amdgpu]
kernel: Code: 00 00 00 3d ff 00 00 00 74 c9 45 84 ff 74 69 85 c0 75 5a bf ac c4 20 00 41 83 c6 01 e8 ab a8 2c e8 41 81 fe e9 03 00 00 75 a7 <0f> 0b 48 8b 44 24 48 65 48 2b 04 25 28 00 00 00 75 57 48 83 c4 50
kernel: RSP: 0018:ffffb92800937990 EFLAGS: 00010246
kernel: RAX: 00000021e9c5343e RBX: 0000000000000002 RCX: 000000000000000a
kernel: RDX: 00000000000cb01c RSI: 00000000000ca4d9 RDI: 00000021e9b88422
kernel: RBP: 0000000000000000 R08: 0000000000000002 R09: ffff8db81255a880
kernel: R10: 000000000000000d R11: 0000000000000001 R12: ffff8db8127f2b20
kernel: R13: ffffb92800937994 R14: 00000000000003e9 R15: 0000000000000001
kernel: FS:  0000000000000000(0000) GS:ffff8dc565f00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000062de55fdce38 CR3: 00000007ef022000 CR4: 0000000000f50ef0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? dmub_replay_enable+0xfe/0x170 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? __warn.cold+0x93/0xed
kernel:  ? dmub_replay_enable+0xfe/0x170 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? report_bug+0xe7/0x210
kernel:  ? handle_bug+0x58/0x90
kernel:  ? exc_invalid_op+0x19/0xc0
kernel:  ? asm_exc_invalid_op+0x1a/0x20
kernel:  ? dmub_replay_enable+0xfe/0x170 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  edp_set_replay_allow_active+0x149/0x1a0 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  amdgpu_dm_replay_enable+0xc1/0xf0 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  amdgpu_dm_commit_planes+0x1fd5/0x1fe0 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? __entry_text_end+0x101e46/0x101e49
kernel:  amdgpu_dm_atomic_commit_tail+0x1312/0x3170 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? __pfx_amdgpu_crtc_get_scanout_position+0x10/0x10 [amdgpu 33bcc48fbf6361168e552fb71591e6476021068b]
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x100/0x3b0
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? dma_fence_default_wait+0x8b/0x240
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? wait_for_completion_timeout+0x130/0x180
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  commit_tail+0x91/0x130
kernel:  process_one_work+0x18f/0x350
kernel:  worker_thread+0x24c/0x380
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xcf/0x100
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x31/0x50
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1a/0x30
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---

Edit: forgot to mention - it is recoverable with cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover, but I haven’t use my laptop after to assess how stable it is.

Edit2: moved from 6.12.4 kernel. No major issues apart of occasional panel refresh hangs, disabling adaptive sync in Plasma settings was enough to make it work, so never went into that rabbit hole. More packages got update with this batch, including xorg-server. I’m in fact using -zen kernel, gonna try if it affects vanilla as well in a moment and gonna drop another edit on the bottom of the post.

1 Like

I believe this may be related to https://gitlab.freedesktop.org/drm/amd/-/issues/3796. You could try reverting the bisected commit there as well.

I can reproduce this and similar lock ups somewhat reliably on Fedora 41 on kernels 6.11.4 onward (earliest one available in GRUB). Just disabled adaptive sync and hopefully this is now resolved, I’ll find out and report back.

Same problem here on a Framework 13 AMD.

Linux kolme 6.11.11-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu, 05 Dec 2024 16:26:44 +0000 x86_64 GNU/Linux

It appears sporadically since like end of October 2024.

After boot, I observe the dmesg below. The computer works fine until suddenly it gets irresponsive. Then I usually see kworker events_unbound processes that take 100% CPU. Only reboot helps then to get the machine back to usable state.

[   60.543139] ------------[ cut here ]------------
[   60.543144] WARNING: CPU: 12 PID: 439 at drivers/gpu/drm/amd/amdgpu/../display/dc/dpp/dcn30/dcn30_dpp.c:534 dpp3_deferred_update+0x101/0x330 [amdgpu]
[   60.543599] Modules linked in: snd_seq_dummy rfcomm snd_hrtimer snd_seq snd_seq_device qrtr cmac algif_hash algif_skcipher af_alg bnep vfat fat snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp amd_atl intel_rapl_msr snd_sof_pci intel_rapl_common snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_pci_ps snd_hda_codec_realtek snd_amd_sdw_acpi soundwire_amd snd_hda_codec_generic mt7921e soundwire_generic_allocation snd_hda_scodec_component mt7921_common soundwire_bus mt792x_lib snd_hda_codec_hdmi mt76_connac_lib snd_soc_core snd_hda_intel mt76 snd_intel_dspcfg snd_compress uvcvideo cros_usbpd_charger leds_cros_ec ac97_bus snd_intel_sdw_acpi gpio_cros_ec cros_kbd_led_backlight led_class_multicolor cros_ec_sysfs cros_ec_chardev snd_pcm_dmaengine cros_usbpd_logger hid_sensor_als cros_charge_control mac80211 cros_usbpd_notify cros_ec_hwmon mousedev cros_ec_debugfs snd_hda_codec videobuf2_vmalloc snd_rpl_pci_acp6x hid_sensor_trigger snd_acp_pci uvc kvm_amd industrialio_triggered_buffer
[   60.543677]  snd_hda_core snd_acp_legacy_common videobuf2_memops kfifo_buf snd_pci_acp6x hid_sensor_iio_common snd_hwdep libarc4 btusb industrialio videobuf2_v4l2 cros_ec_dev spd5118 snd_pcm snd_pci_acp5x btrtl sp5100_tco cfg80211 videodev snd_timer btintel kvm snd_rn_pci_acp3x joydev hid_multitouch hid_sensor_hub btbcm ucsi_acpi snd_acp_config snd videobuf2_common typec_ucsi btmtk cros_ec_lpcs i2c_piix4 snd_soc_acpi cros_ec bluetooth mc wmi_bmof cdc_acm rapl typec pcspkr amd_pmf thunderbolt snd_pci_acp3x rfkill soundcore i2c_smbus k10temp roles amdtee i2c_hid_acpi amd_sfh platform_profile i2c_hid amd_pmc mac_hid i2c_dev crypto_user loop nfnetlink ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod hid_generic usbhid amdgpu crct10dif_pclmul crc32_pclmul crc16 crc32c_intel amdxcp i2c_algo_bit polyval_clmulni drm_ttm_helper polyval_generic ghash_clmulni_intel ttm serio_raw sha512_ssse3 drm_exec atkbd sha256_ssse3 libps2 sha1_ssse3
[   60.543800]  gpu_sched aesni_intel drm_suballoc_helper vivaldi_fmap nvme gf128mul drm_buddy crypto_simd drm_display_helper nvme_core cryptd xhci_pci i8042 ccp video cec xhci_pci_renesas nvme_auth serio wmi
[   60.543825] CPU: 12 UID: 0 PID: 439 Comm: kworker/u64:6 Not tainted 6.11.11-1-MANJARO #1 bd5d6dc86bb7f74bdd4f3d28006de2c066abf4fd
[   60.543831] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.05 03/29/2024
[   60.543834] Workqueue: events_unbound commit_work
[   60.543841] RIP: 0010:dpp3_deferred_update+0x101/0x330 [amdgpu]
[   60.544131] Code: 83 78 e1 00 00 0f b6 90 a8 02 00 00 48 8b 83 70 e1 00 00 8b b0 78 04 00 00 e8 7b c0 11 00 8b 74 24 04 85 f6 0f 84 5d 01 00 00 <0f> 0b 0f b6 83 48 96 00 00 83 e0 f7 88 83 48 96 00 00 a8 01 0f 84
[   60.544132] RSP: 0018:ffffb9878061fba0 EFLAGS: 00010202
[   60.544134] RAX: 0000000000000066 RBX: ffff94e1d4b80000 RCX: 0000000000000004
[   60.544135] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff94e1d3a80000
[   60.544136] RBP: ffff94e232080000 R08: ffffb9878061fba4 R09: ffffb9878061fbd0
[   60.544137] R10: ffffb9878061fb48 R11: 0000000000000000 R12: 0000000000000000
[   60.544138] R13: ffff94e2320840a8 R14: ffff94e232085f78 R15: ffff94e20411f200
[   60.544139] FS:  0000000000000000(0000) GS:ffff94e91e800000(0000) knlGS:0000000000000000
[   60.544140] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.544141] CR2: 000072282420d000 CR3: 0000000732222000 CR4: 0000000000f50ef0
[   60.544142] PKRU: 55555554
[   60.544143] Call Trace:
[   60.544146]  <TASK>
[   60.544147]  ? dpp3_deferred_update+0x101/0x330 [amdgpu bb0261db58d620ad314df009a0edb3b84ae8ec8d]
[   60.544297]  ? __warn.cold+0x8e/0xe8
[   60.544300]  ? dpp3_deferred_update+0x101/0x330 [amdgpu bb0261db58d620ad314df009a0edb3b84ae8ec8d]
[   60.544444]  ? report_bug+0xff/0x140
[   60.544448]  ? handle_bug+0x58/0x90
[   60.544449]  ? exc_invalid_op+0x17/0x70
[   60.544451]  ? asm_exc_invalid_op+0x1a/0x20
[   60.544454]  ? dpp3_deferred_update+0x101/0x330 [amdgpu bb0261db58d620ad314df009a0edb3b84ae8ec8d]
[   60.544595]  dc_post_update_surfaces_to_stream+0x1b1/0x2b0 [amdgpu bb0261db58d620ad314df009a0edb3b84ae8ec8d]
[   60.544743]  amdgpu_dm_atomic_commit_tail+0x2cca/0x3ab0 [amdgpu bb0261db58d620ad314df009a0edb3b84ae8ec8d]
[   60.544912]  ? srso_alias_return_thunk+0x5/0xfbef5
[   60.544914]  ? drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x162/0x3a0
[   60.544917]  ? srso_alias_return_thunk+0x5/0xfbef5
[   60.544919]  ? dma_fence_default_wait+0x8b/0x250
[   60.544922]  ? srso_alias_return_thunk+0x5/0xfbef5
[   60.544923]  ? wait_for_completion_timeout+0x130/0x180
[   60.544925]  ? srso_alias_return_thunk+0x5/0xfbef5
[   60.544926]  ? dma_fence_wait_timeout+0x108/0x140
[   60.544930]  commit_tail+0x91/0x130
[   60.544932]  process_one_work+0x17b/0x330
[   60.544936]  worker_thread+0x2ce/0x3f0
[   60.544938]  ? __pfx_worker_thread+0x10/0x10
[   60.544940]  kthread+0xcf/0x100
[   60.544942]  ? __pfx_kthread+0x10/0x10
[   60.544945]  ret_from_fork+0x31/0x50
[   60.544947]  ? __pfx_kthread+0x10/0x10
[   60.544949]  ret_from_fork_asm+0x1a/0x30
[   60.544953]  </TASK>
[   60.544954] ---[ end trace 0000000000000000 ]---

It might be this issue here: https://gitlab.freedesktop.org/drm/amd/-/issues/3647

When it happened lastly I could recover from it by triggering a gpu reset by cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover

I will reboot with added kernel parameter amdgpu.dcdebugmask=0x10 and see if it helps.

Looks to me to be the same issue discussed here: http://community.frame.work/t/fullscreen-games-freeze-on-plasma-6-with-dgpu/

amdgpu.dcdebugmask=0x400 has been working for most of us, which I believe simply disables panel replay. Issue was supposed to be fixed a while ago, but clearly there’s still some problems with it… I personally found building 6.13 myself had no issue but fedora’s rawhide of the same version did have the issue, so maybe a kernel configuration issue? Not sure.