Random freezes with Arch Linux (requires hard reboot)

Hi,

Another way for people to help diagnose this problem is for them to purchase or make an EC CCD (Closed Case Debug).
For example, I have one of the ones from here:

It works on both FW13 and FW16.

It permits console port access to the EC.
For example, I have modified my Linux kernel to output port80 codes if it does a kernel panic.
After the Freeze, and a reboot, I can then go into the EC and dump the history of port80 codes. So I can see if it did a kernel panic or froze without doing a panic.
Thus narrowing down the problem is little.
So, if more people had EC CCD, it might find the root cause quicker.

It is interesting that @Yam found that no sysrq keys worked.
For example, if you press the sysrq reboot key, and it does not reboot, it implies that it is not in a normal kernel panic handler at the time of the freeze.
Ok, we don’t know what it is, but it is helping us discount lots of things it might be.

@jared_kidd I’m sorry, but how would that help when the LTS kernel also has this issue ? I’m really thinking this isn’t just a kernel issue at this point.

Thanks, @James3 ! That looks very interesting, and I’d definitely be down to get one and try to debug framework issues. I’m just not sure how to use it.

I can also confirm sysrq not working, while those key are working normally otherwise. I suspect it might be a CPU freeze due to power state change. I think I have never seen it frozen while on performance mode.

I’m having visual glitches here and there, they’re very sudden but also very short, as described here https://gitlab.freedesktop.org/drm/amd/-/issues/3388

One of the solutions mentioned is to set /sys/class/drm/card1/device/power_dpm_force_performance_level to high instead of auto, which seems to definitely fix the problem for me right now. I’m wondering if this is related to (at least) one of the crashes experienced.

I think these crashes have nothing to do with the Linux kernel or even Arch, they really seem related to the AMD CPU and possibly the Framework laptop in general. Still keeping a close eye on FW16 Freeze then Reboot (FTR) · Issue #41 · FrameworkComputer/SoftwareFirmwareIssueTracker · GitHub as it seems to be going somewhere

2 Likes

I don’t have those visual glitches (other than audio glitches). Thanks for the links. One thing to note: I’m on dual boot and I’ve never seen these freezes on Windows, though I do use Linux more when I’m on Windows I’m almost always plugged in.

Coming back here after a full month.

I haven’t experienced a single crash in a full month of intensive usage, until today, when it randomly froze again. Had to manually shut it down, then like 20 minutes later it froze again but restarted on its own.

I can’t say if I was just lucky or if something made it not crash during this month, but it’s definitely a big mystery still. Worth noting that I have many visual glitches happening still, but only for a few seconds in a whole day.

I suggest following FRWK16 - Random Crash then Reboots - #59 by sinatosk as well for those having the same problem, we’re slowly but surely getting somewhere.

Totally same thing for me.

Also in my case it kills bluetooth - Manjaro sway bluetooth no default controller - #2 by Dmitry_Kisel

Although the freezes are less frequent than before, I noticed something interesting.

Whenever the computer freezes completely (and doesn’t auto-restart), the next time I reboot, I’m almost certain to experience another freeze in the upcoming hours (if not minutes), but the computer will auto-restart this time. Thereafter, it comes back to normal, and it’s up to good luck whether I get another crash or not.

I’m here to drop another data point, I’m on a complete different OS (Nitrux) and Kernel (6.11.5-1-liquorix-amd64), I am using Wayland and I have the GPU. I’m experiencing random freezes occasionally but it never reboots on its own, today was the first time it froze more than once

And another, Manjaro, Arch based but…

It only happened reproducible on KDE with X11 and not with Wayland. With Wayland I had not freezes yet.

Of course no useable logs :frowning:

Hardware: a fresh Framework AI 300 laptop.

To anyone having problem with the freeze: switch to a distro with a more recent kernel (>6.13.11) should fix it. I haven’t had a freeze since upgrading.

Well, just yesterday TW 6.14.2 kernel froze when I was reviewing some code in Chrome.

Are you using X11 or wayland? From the poster above, that appears to be a contributing factor. If using X11, you should switch over to wayland. I am on wayland and have never had a freeze. amdgpu crashes that recover, yes, but never an outright freeze.

I’m using wayland & gnome shell. The instability is really bothering me. Random, but still somewhat recoverable GPU lockups were obviously there before, but it became significantly worse with 6.13 kernel and mesa 25 becoming available, with screen corruption as reported in other threads and random freezes.

Hi!
I had almost the same issue with random freezes on my laptop with AMD Ryzen 2500U. What I have found is that the problem is with the open-source AMDGPU driver; most probably it goes to some wrong power profiles (read more about power profiles here AMDGPU - ArchWiki sections 5.4 and 5.5.) For example, when I tried to activate the COMPUTE power profile mode, my system freezed almost instantly. However, when I am using the BOOTUP_DEFAULT power profile then everything is stable and works perfectly. The same about stability is also true when I am using the low or high performance levels. To verify that we have the same problems, try to use low, high, profile_standard or BOOUTUP_DEFAULT profiles.

P.S. This might be also intrestingly GPU Power/Thermal Controls and Monitoring — The Linux Kernel documentation

P.S.S. I also added rcu_nocbs=0-7 (numbers here are a number of threads at CPU); however, I it did not solved the issue, but it might have increased the general stability of the system.

1 Like

Hello folks, I have similar or the same problem, but rather different setup.

System:
  Host: Kernel: 6.1.138-1-MANJARO arch: x86_64 bits: 64
  Desktop: GNOME v: 48.2 Distro: Manjaro Linux
Machine:
  Type: Laptop System: ThinkPad T14 Gen 3
CPU:
  Info: 8-core model: AMD Ryzen 7 PRO 6850U with Radeon Graphics
Graphics:
  Device-1: Advanced Micro Devices [AMD/ATI] Rembrandt [Radeon 680M]
    driver: amdgpu v: kernel
  Display: unspecified server: X.org v: 1.21.1.18 with: Xwayland v: 24.1.8
    driver: X: loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi
    gpu: amdgpu resolution: 3840x2400~60Hz

I’m using Gnome Shell 48.

Similar symptoms — random freezes with no clear pattern. When actively working or idling, when suspending or waking up. Virtual terminal is not accessible, Caps lock doesn’t blink (usual kernel panic indicator).

I’ve tried to enable crashdump / kdump. The kernel is loaded fine:

# cat /sys/kernel/kexec_crash_loaded
1

but never loads on the case of crash, nor during echo c > /proc/sysrq-trigger test.

I’ve changed kernel’s cmdline to gather more traces:

crashkernel=256M oops=panic panic_print=32 printk.always_kmsg_dump=1 loglevel=7 panic=10 sysrq_always_enabled=1

When I caused the crash via /proc/sysrq-trigger or SysRq hotkey it was self-restarted after 10 seconds (due to panic=10). But when it freezes — that doesn’t happen. I also don’t think this problem is kernel panic.

I’ve only got to know about pstore from this topic and also can’t make it work, it’s also empty in my case.

Pageflip timed out error log mentioned above hasn’t ever happened to me, so it could be unrelated.

The only patterns I’ve found so far:

  • Gnome Terminal increases chances of the crash a lot. From “few times a week” to “many times a day”. Guake terminal, Firefox, other common programs don’t cause that.
  • [Expectedly] suspend & resume are risky. Sometimes it takes minutes to suspend — and then welcomes with a black screen after resume. Or even keeps running even when closed.

It could also several independent issues. I’ve noticed different behaviors:

  • Sometimes I can switch Caps Lock indicator, sometimes can’t. Other leds (mute, micmute) are always unresponsive during crash.
  • Sometimes it keeps connected to wi-fi, in other case disconnects instantly. When it’s connected I can ping it and even initiate SSH session. But it never completes the handshake. Once it has successfully executed a cronjob to update pacman’s keys during “crash”.
  • Sometimes SysRq hotkeys work. I was able to Sync (SysRq+s) and reboot (SysRq+b). In such cases that I see some post-crash logs in journald. In other cases SysRq doesn’t work, but Ctrl-Alt-Del does.

Emergency Sync has saved me some suspicious logs during the crash:

One crash — from kernel
kernel: ------------[ cut here ]------------
kernel: amdgpu 0000:04:00.0: drm_WARN_ON(!dev->mode_config.poll_enabled)
kernel: WARNING: CPU: 9 PID: 469202 at drivers/gpu/drm/drm_probe_helper.c:838 drm_kms_helper_poll_disable+0x55/0x60
kernel: Modules linked in: nf_conntrack_netlink veth uinput rfcomm rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache netfs xt_conntrack xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat ccm michael_mic overlay cmac algif_hash algif_skcipher af_alg bnep tun nf_tables qrtr_mhi btusb btrtl btbcm uvcvideo btintel videobuf2_vmalloc videobuf2_memops btmtk videobuf2_v4l2 videobuf2_common bluetooth videodev ecdh_generic mc crc16 snd_soc_acp6x_mach snd_acp6x_pdm_dma snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp cdc_mbim snd_sof_pci cdc_wdm cdc_ncm snd_sof cdc_ether snd_sof_utils option usbnet usb_wwan mii snd_ctl_led qrtr vfat snd_soc_core snd_hda_codec_realtek ath11k_pci snd_hda_codec_hdmi fat snd_hda_codec_generic snd_compress ath11k intel_rapl_msr ac97_bus snd_pcm_dmaengine intel_rapl_common snd_hda_intel snd_pci_ps qmi_helpers snd_rpl_pci_acp6x
kernel:  snd_intel_dspcfg snd_acp_pci edac_mce_amd snd_intel_sdw_acpi snd_pci_acp6x mac80211 snd_pci_acp5x kvm_amd snd_hda_codec r8169 libarc4 snd_rn_pci_acp3x joydev mousedev snd_acp_config snd_hda_core realtek kvm cfg80211 ucsi_acpi snd_soc_acpi think_lmi sp5100_tco typec_ucsi mdio_devres snd_hwdep hid_multitouch irqbypass snd_pcm rapl psmouse pcspkr typec firmware_attributes_class wmi_bmof k10temp snd_timer snd_pci_acp3x libphy mhi i2c_piix4 roles i2c_hid_acpi amd_pmc i2c_hid acpi_cpufreq acpi_tad mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) crypto_user loop fuse nfnetlink bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic drm_ttm_helper gf128mul serio_raw ghash_clmulni_intel ttm sha512_ssse3 atkbd libps2 thinkpad_acpi vivaldi_fmap sha256_ssse3 ledtrig_audio sha1_ssse3 platform_profile gpu_sched
kernel:  snd aesni_intel nvme crypto_simd soundcore drm_buddy cryptd rfkill drm_display_helper nvme_core video xhci_pci ccp cec i8042 nvme_common xhci_pci_renesas serio wmi
kernel: CPU: 9 PID: 469202 Comm: kworker/u32:37 Kdump: loaded Tainted: G        W  OE      6.1.138-1-MANJARO #1
kernel: Workqueue: events_unbound async_run_entry_fn
kernel: RIP: 0010:drm_kms_helper_poll_disable+0x55/0x60
kernel: Code: 85 d2 75 03 48 8b 17 48 89 14 24 e8 55 5f 01 00 48 8b 14 24 48 c7 c1 a0 50 18 b3 48 c7 c7 b7 74 0e b3 48 89 c6 e8 3b 1c 8a ff <0f> 0b 48 83 c4 08 e9 00 ea 7f 00 f3 0f 1e fa 0f 1f 44 00 00 55 53
kernel: RSP: 0018:ffffb5cd48c6fd90 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff8ef294700010 RCX: 0000000000000027
kernel: RDX: ffff8ef99f061668 RSI: 0000000000000001 RDI: ffff8ef99f061660
kernel: RBP: ffff8ef294700000 R08: ffffffffb385c800 R09: 000000000000002b
kernel: R10: ffffffffb30e74c0 R11: 0000000000000000 R12: 0000000000000001
kernel: R13: 0000000000000002 R14: 0000000000000000 R15: ffff8ef283d3c1a8
kernel: FS:  0000000000000000(0000) GS:ffff8ef99f040000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fbcc8a5f5a8 CR3: 00000006fee10000 CR4: 0000000000750ee0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  amdgpu_device_suspend+0x59/0x170 [amdgpu 49f253dc7aa5235aab2889663ce44902685c303b]
kernel:  ? srso_alias_return_thunk+0x5/0x7f
kernel:  pci_pm_suspend+0x80/0x170
kernel:  ? pci_pm_freeze+0xc0/0xc0
kernel:  dpm_run_callback+0x4a/0x150
kernel:  __device_suspend+0x12f/0x4f0
kernel:  ? srso_alias_return_thunk+0x5/0x7f
kernel:  async_suspend+0x21/0xa0
kernel:  ? srso_alias_return_thunk+0x5/0x7f
kernel:  async_run_entry_fn+0x34/0x130
kernel:  process_one_work+0x1cf/0x3a0
kernel:  ? process_one_work+0x3a0/0x3a0
kernel:  worker_thread+0x50/0x390
kernel:  ? process_one_work+0x3a0/0x3a0
kernel:  kthread+0xde/0x110
kernel:  ? kthread_complete_and_exit+0x20/0x20
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---
Another case — from gdm-x-session
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): flip queue failed: Invalid argument
/usr/lib/gdm-x-session[2114]: (WW) AMDGPU(0): Page flip failed: Invalid argument

But according to logs, these messages were sometimes posted without fatal consequences

There’s a known (or suspected) bug in the AMD drivers, discussed here, with symptoms very similar to what you’re describing:

What they’re asking people to do is try this kernel:

There is actually an arch package that claims to be that amd-staging-drm-next version of the kernel, but to me it looks out of date; I think you may need to compile your own kernel if you’re comfortable if you want to try out that. I’m currently trying it myself because I’m having a similar issue; I can let you know what I find if you want.

2 Likes

Hi @AntonR, I can confirm that although I have almost no more crashes as of now, I do remember getting these pageflip timed out errors. I didn’t think much of them at the time as I didn’t find anything very relevant online and the logs were filled with them. Looking at it now, I don’t seem to be getting this error at all anymore, so maybe it was indeed the root cause?

1 Like

I haven’t had a single mention of pageflip in the entire journal — since Nov 2024.

Since the last message I’ve upgraded kernel to 6.12.34-1-MANJARO (was 6.1.138-1-MANJARO) and already got one freeze with nothing saved in logs

A new victim here😔. My specs: ADM Ryzen 7 5800H, Arch Linux, Kernel Linux 6.17.7

My laptop had this problem from the very beginning, but I fixed it easily since it only appeared when it went into sleep mode. So Months passed, everything was normal. But today, believe it or not, there were 20 freezes and reboots in 6 hours. And it’s appear on every desktop environment, but not it TTY. There is nothing interesting in the logs, I haven’t installed anything special in the last few days. The REISUB does not work during freezings, and laptop is totally unresponsive to anything. Idk man, is Linux that trash?:sob::victory_hand: Is like trying to keep alive a old truck by fixing it everytime.