Issue
Background
I’m experiencing frequent and severe system freezes, which I’m trying to debug. With severe freezes, I mean that the system becomes completely unresponsive, requires a hard reset, and displays a glitched-out frame. I strongly suspect these freezes are somehow GPU-related, since they happen mostly when I’m taxing my GPU. For example, my system might freeze after 5 minutes of playing a 3D game or watching a (hw-accelerated) video, especially when doing so on a 4K monitor.
FTR: this has only been happening for the last few months or so, and happens on linux-firmware-git
, as well as multiple kernels (at least 6.10, 6.11.2, and the latest LTS kernel).
This issue really deserves a separate thread, however, and is not the main topic here.
Given my symptoms, I’m suspecting either a kernel panic, a hardware fault, or panic triggered by a hardware fault. The exact culprit is difficult to determine, however, since my crashed kernel seems unable to save any (dmesg) logs whenever the panic is triggered.
The problem: getting kexec
to work after triggering a crash
I’m led to believe that in these situations, people often resort to kdump+kexec. That way, one can perform post-mortem analysis on the crashed kernel, and hopefully gather some dmesg output (since dmesg output may still exist in RAM, I guess?). But please correct me if I’m wrong!
My problem is that I’m unable to get kexec to boot the kdump kernel after triggering a crash, be it manually through echo c > /proc/sysrq-trigger
or by taxing my GPU. According to the Arch Wiki and other references, setting up kexec should require just a few steps:[1]
# add the crashkernel= parameter
$ efibootmgr --create --disk /dev/nvme0n1 --loader /vmlinuz-linux-lts -b 0 --unicode "root=UUID=<uuid> initrd=\initramfs-linux-lts.img crashkernel=1G"
$ reboot
# boot into single user mode after a crash, and ask devices to reset themselves:
$ kexec -p /boot/vmlinuz-linux-lts --initrd=/boot/initramfs-linux-lts.img --append="root=UUID=<uuid> irqpoll nr_cpus=1 reset_devices single"
$ sync; echo c > /proc/sysrq-trigger # trigger a manual crash (requires sysrq keys to be enabled)
However, this result in either of two situations, depending on what permutation of kernel parameters I use. Specifically, the system either freezes, and then requires a hard reset; or reboots normally, but doesn’t load my kdump kernel, nor enters single-user mode, as I specified. I should stress that this happens with any combination of the kernel parameters listed above (e.g. without the nr_cpus=1
parameter)
If I just load a kernel with kexec -l
and subsequently execute it with kexec -e
, everything seems to work fine. Note, however, that this is only the case if I strip the parameters recommended by the Arch Wiki (i.e. remove irqpoll nr_cpus=1 reset_devices
):
$ kexec -l /boot/vmlinuz-linux-lts --initrd=/boot/initramfs-linux-lts.img --append="root=UUID=<uuid> rw single"
$ kexec -e # execute other kernel
...
# System boots in single-user mode!
Since the above does work, I’m led believe that my framework laptop requires a very specific set of kernel parameters in order for kexec -p
to work. In fact, googling for problems relating to kexec -p
freezing/not working, leads me to threads were people report that kexec -p
only works on their machine when providing specific parameters to --append
. For example, one user reports that they need nr_cpus=4 disable_cpu_apicid=0 reset_devices
.
Moreover, I’ve tested this on an older Arch laptop (a Huawei machine with an Intel CPU), where simply running the following boots me into single user mode, as expected:
$ kexec -p /boot/vmlinuz-linux-lts --initrd=/boot/initramfs-linux-lts.img --append="root=UUID=<uuid> rw irqpoll nr_cpus=1 reset_devices single
$ echo c > /proc/sysrq-trigger
This further strengthens my belief that the instructions in the Arch Wiki are more or less correct, and that kexec -p
may require a specific set of kernel parameters in order to work on my framework laptop.
Finally, the Valve-sponsored kdumpst project appends this to the commandline: panic=-1 oops=panic fsck.mode=force fsck.repair=yes nr_cpus=1 reset_devices initcall_blacklist=drm_core_init module_blacklist=amdgpu,i915,nouveau
. (still doesn’t work, however )
So, my question is: has anyone gotten kexec -p
to work reliably? And if so, what kernel parameters did you use?
System info
- Distro: Arch (fully updated)
- Kernel: latest (6.11.2-arch1-1), LTS (6.6.54-1)
- BIOS: 3.05
- Laptop model: Framework 13, with AMD Ryzen 7 7840U + Radeon 780M Graphics
Moreover, I’ve verified that my kernels are compiled with these options:
CONFIG_DEBUG_INFO=y
CONFIG_CRASH_DUMP=y
CONFIG_PROC_VMCORE=y
I’m using the LTS kernel in these examples, but the same observations apply to the latest stable kernel. ↩︎