Hello folks, I have similar or the same problem, but rather different setup.
System:
Host: Kernel: 6.1.138-1-MANJARO arch: x86_64 bits: 64
Desktop: GNOME v: 48.2 Distro: Manjaro Linux
Machine:
Type: Laptop System: ThinkPad T14 Gen 3
CPU:
Info: 8-core model: AMD Ryzen 7 PRO 6850U with Radeon Graphics
Graphics:
Device-1: Advanced Micro Devices [AMD/ATI] Rembrandt [Radeon 680M]
driver: amdgpu v: kernel
Display: unspecified server: X.org v: 1.21.1.18 with: Xwayland v: 24.1.8
driver: X: loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi
gpu: amdgpu resolution: 3840x2400~60Hz
I’m using Gnome Shell 48.
Similar symptoms — random freezes with no clear pattern. When actively working or idling, when suspending or waking up. Virtual terminal is not accessible, Caps lock doesn’t blink (usual kernel panic indicator).
I’ve tried to enable crashdump / kdump. The kernel is loaded fine:
# cat /sys/kernel/kexec_crash_loaded
1
but never loads on the case of crash, nor during echo c > /proc/sysrq-trigger test.
I’ve changed kernel’s cmdline to gather more traces:
crashkernel=256M oops=panic panic_print=32 printk.always_kmsg_dump=1 loglevel=7 panic=10 sysrq_always_enabled=1
When I caused the crash via /proc/sysrq-trigger or SysRq hotkey it was self-restarted after 10 seconds (due to panic=10). But when it freezes — that doesn’t happen. I also don’t think this problem is kernel panic.
I’ve only got to know about pstore from this topic and also can’t make it work, it’s also empty in my case.
Pageflip timed out error log mentioned above hasn’t ever happened to me, so it could be unrelated.
The only patterns I’ve found so far:
- Gnome Terminal increases chances of the crash a lot. From “few times a week” to “many times a day”. Guake terminal, Firefox, other common programs don’t cause that.
- [Expectedly] suspend & resume are risky. Sometimes it takes minutes to suspend — and then welcomes with a black screen after resume. Or even keeps running even when closed.
It could also several independent issues. I’ve noticed different behaviors:
- Sometimes I can switch Caps Lock indicator, sometimes can’t. Other leds (mute, micmute) are always unresponsive during crash.
- Sometimes it keeps connected to wi-fi, in other case disconnects instantly. When it’s connected I can ping it and even initiate SSH session. But it never completes the handshake. Once it has successfully executed a cronjob to update
pacman’s keys during “crash”.
- Sometimes SysRq hotkeys work. I was able to Sync (SysRq+s) and reboot (SysRq+b). In such cases that I see some post-crash logs in journald. In other cases SysRq doesn’t work, but Ctrl-Alt-Del does.
Emergency Sync has saved me some suspicious logs during the crash:
One crash — from kernel
kernel: ------------[ cut here ]------------
kernel: amdgpu 0000:04:00.0: drm_WARN_ON(!dev->mode_config.poll_enabled)
kernel: WARNING: CPU: 9 PID: 469202 at drivers/gpu/drm/drm_probe_helper.c:838 drm_kms_helper_poll_disable+0x55/0x60
kernel: Modules linked in: nf_conntrack_netlink veth uinput rfcomm rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache netfs xt_conntrack xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat ccm michael_mic overlay cmac algif_hash algif_skcipher af_alg bnep tun nf_tables qrtr_mhi btusb btrtl btbcm uvcvideo btintel videobuf2_vmalloc videobuf2_memops btmtk videobuf2_v4l2 videobuf2_common bluetooth videodev ecdh_generic mc crc16 snd_soc_acp6x_mach snd_acp6x_pdm_dma snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp cdc_mbim snd_sof_pci cdc_wdm cdc_ncm snd_sof cdc_ether snd_sof_utils option usbnet usb_wwan mii snd_ctl_led qrtr vfat snd_soc_core snd_hda_codec_realtek ath11k_pci snd_hda_codec_hdmi fat snd_hda_codec_generic snd_compress ath11k intel_rapl_msr ac97_bus snd_pcm_dmaengine intel_rapl_common snd_hda_intel snd_pci_ps qmi_helpers snd_rpl_pci_acp6x
kernel: snd_intel_dspcfg snd_acp_pci edac_mce_amd snd_intel_sdw_acpi snd_pci_acp6x mac80211 snd_pci_acp5x kvm_amd snd_hda_codec r8169 libarc4 snd_rn_pci_acp3x joydev mousedev snd_acp_config snd_hda_core realtek kvm cfg80211 ucsi_acpi snd_soc_acpi think_lmi sp5100_tco typec_ucsi mdio_devres snd_hwdep hid_multitouch irqbypass snd_pcm rapl psmouse pcspkr typec firmware_attributes_class wmi_bmof k10temp snd_timer snd_pci_acp3x libphy mhi i2c_piix4 roles i2c_hid_acpi amd_pmc i2c_hid acpi_cpufreq acpi_tad mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) crypto_user loop fuse nfnetlink bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic drm_ttm_helper gf128mul serio_raw ghash_clmulni_intel ttm sha512_ssse3 atkbd libps2 thinkpad_acpi vivaldi_fmap sha256_ssse3 ledtrig_audio sha1_ssse3 platform_profile gpu_sched
kernel: snd aesni_intel nvme crypto_simd soundcore drm_buddy cryptd rfkill drm_display_helper nvme_core video xhci_pci ccp cec i8042 nvme_common xhci_pci_renesas serio wmi
kernel: CPU: 9 PID: 469202 Comm: kworker/u32:37 Kdump: loaded Tainted: G W OE 6.1.138-1-MANJARO #1
kernel: Workqueue: events_unbound async_run_entry_fn
kernel: RIP: 0010:drm_kms_helper_poll_disable+0x55/0x60
kernel: Code: 85 d2 75 03 48 8b 17 48 89 14 24 e8 55 5f 01 00 48 8b 14 24 48 c7 c1 a0 50 18 b3 48 c7 c7 b7 74 0e b3 48 89 c6 e8 3b 1c 8a ff <0f> 0b 48 83 c4 08 e9 00 ea 7f 00 f3 0f 1e fa 0f 1f 44 00 00 55 53
kernel: RSP: 0018:ffffb5cd48c6fd90 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff8ef294700010 RCX: 0000000000000027
kernel: RDX: ffff8ef99f061668 RSI: 0000000000000001 RDI: ffff8ef99f061660
kernel: RBP: ffff8ef294700000 R08: ffffffffb385c800 R09: 000000000000002b
kernel: R10: ffffffffb30e74c0 R11: 0000000000000000 R12: 0000000000000001
kernel: R13: 0000000000000002 R14: 0000000000000000 R15: ffff8ef283d3c1a8
kernel: FS: 0000000000000000(0000) GS:ffff8ef99f040000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fbcc8a5f5a8 CR3: 00000006fee10000 CR4: 0000000000750ee0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: <TASK>
kernel: amdgpu_device_suspend+0x59/0x170 [amdgpu 49f253dc7aa5235aab2889663ce44902685c303b]
kernel: ? srso_alias_return_thunk+0x5/0x7f
kernel: pci_pm_suspend+0x80/0x170
kernel: ? pci_pm_freeze+0xc0/0xc0
kernel: dpm_run_callback+0x4a/0x150
kernel: __device_suspend+0x12f/0x4f0
kernel: ? srso_alias_return_thunk+0x5/0x7f
kernel: async_suspend+0x21/0xa0
kernel: ? srso_alias_return_thunk+0x5/0x7f
kernel: async_run_entry_fn+0x34/0x130
kernel: process_one_work+0x1cf/0x3a0
kernel: ? process_one_work+0x3a0/0x3a0
kernel: worker_thread+0x50/0x390
kernel: ? process_one_work+0x3a0/0x3a0
kernel: kthread+0xde/0x110
kernel: ? kthread_complete_and_exit+0x20/0x20
kernel: ret_from_fork+0x22/0x30
kernel: </TASK>
kernel: ---[ end trace 0000000000000000 ]---
Another case — from gdm-x-session
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
But according to logs, these messages were sometimes posted without fatal consequences