Has anyone managed to get kexec -p working?

Issue

Background

I’m experiencing frequent and severe system freezes, which I’m trying to debug. With severe freezes, I mean that the system becomes completely unresponsive, requires a hard reset, and displays a glitched-out frame. I strongly suspect these freezes are somehow GPU-related, since they happen mostly when I’m taxing my GPU. For example, my system might freeze after 5 minutes of playing a 3D game or watching a (hw-accelerated) video, especially when doing so on a 4K monitor.

FTR: this has only been happening for the last few months or so, and happens on linux-firmware-git, as well as multiple kernels (at least 6.10, 6.11.2, and the latest LTS kernel).

This issue really deserves a separate thread, however, and is not the main topic here.

Given my symptoms, I’m suspecting either a kernel panic, a hardware fault, or panic triggered by a hardware fault. The exact culprit is difficult to determine, however, since my crashed kernel seems unable to save any (dmesg) logs whenever the panic is triggered.

The problem: getting kexec to work after triggering a crash

I’m led to believe that in these situations, people often resort to kdump+kexec. That way, one can perform post-mortem analysis on the crashed kernel, and hopefully gather some dmesg output (since dmesg output may still exist in RAM, I guess?). But please correct me if I’m wrong!

My problem is that I’m unable to get kexec to boot the kdump kernel after triggering a crash, be it manually through echo c > /proc/sysrq-trigger or by taxing my GPU. According to the Arch Wiki and other references, setting up kexec should require just a few steps:[1]

# add the crashkernel= parameter
$ efibootmgr --create --disk /dev/nvme0n1 --loader /vmlinuz-linux-lts -b 0 --unicode "root=UUID=<uuid> initrd=\initramfs-linux-lts.img crashkernel=1G" 
$ reboot
# boot into single user mode after a crash, and ask devices to reset themselves:
$ kexec -p /boot/vmlinuz-linux-lts --initrd=/boot/initramfs-linux-lts.img --append="root=UUID=<uuid> irqpoll nr_cpus=1 reset_devices single" 
$ sync; echo c > /proc/sysrq-trigger # trigger a manual crash (requires sysrq keys to be enabled)

However, this result in either of two situations, depending on what permutation of kernel parameters I use. Specifically, the system either freezes, and then requires a hard reset; or reboots normally, but doesn’t load my kdump kernel, nor enters single-user mode, as I specified. I should stress that this happens with any combination of the kernel parameters listed above (e.g. without the nr_cpus=1 parameter)

If I just load a kernel with kexec -l and subsequently execute it with kexec -e, everything seems to work fine. Note, however, that this is only the case if I strip the parameters recommended by the Arch Wiki (i.e. remove irqpoll nr_cpus=1 reset_devices):

$ kexec -l /boot/vmlinuz-linux-lts --initrd=/boot/initramfs-linux-lts.img --append="root=UUID=<uuid> rw single"
$ kexec -e # execute other kernel
...
# System boots in single-user mode!

Since the above does work, I’m led believe that my framework laptop requires a very specific set of kernel parameters in order for kexec -p to work. In fact, googling for problems relating to kexec -p freezing/not working, leads me to threads were people report that kexec -p only works on their machine when providing specific parameters to --append. For example, one user reports that they need nr_cpus=4 disable_cpu_apicid=0 reset_devices.

Moreover, I’ve tested this on an older Arch laptop (a Huawei machine with an Intel CPU), where simply running the following boots me into single user mode, as expected:

$ kexec -p /boot/vmlinuz-linux-lts --initrd=/boot/initramfs-linux-lts.img --append="root=UUID=<uuid> rw irqpoll nr_cpus=1 reset_devices single
$ echo c > /proc/sysrq-trigger

This further strengthens my belief that the instructions in the Arch Wiki are more or less correct, and that kexec -p may require a specific set of kernel parameters in order to work on my framework laptop.

Finally, the Valve-sponsored kdumpst project appends this to the commandline: panic=-1 oops=panic fsck.mode=force fsck.repair=yes nr_cpus=1 reset_devices initcall_blacklist=drm_core_init module_blacklist=amdgpu,i915,nouveau. (still doesn’t work, however :slightly_frowning_face:)

So, my question is: has anyone gotten kexec -p to work reliably? And if so, what kernel parameters did you use?

System info

  • Distro: Arch (fully updated)
  • Kernel: latest (6.11.2-arch1-1), LTS (6.6.54-1)
  • BIOS: 3.05
  • Laptop model: Framework 13, with AMD Ryzen 7 7840U + Radeon 780M Graphics

Moreover, I’ve verified that my kernels are compiled with these options:

CONFIG_DEBUG_INFO=y
CONFIG_CRASH_DUMP=y
CONFIG_PROC_VMCORE=y

  1. I’m using the LTS kernel in these examples, but the same observations apply to the latest stable kernel. ↩︎