[RESPONDED] Hard resets running VMs on AMD 7640U

I have a Framework 13 with AMD Ryzen 7 7840U. I’ve experienced hard resets running Windows 11 in a VM on Linux using Qemu. It happened twice so far roughly a week apart.

The symptom is that the laptop just reboots. There is nothing in the logs.

I’m mostly leaving this here to see whether other people also have this problem.

Nothing in which logs? System logs? Kernel logs? User logs?

Can you list us the QEMU arguments?

No, there is nothing in either log. My qemu args:

/run/libvirt/nix-emulators/qemu-system-x86_64 -name guest=win11,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-7-win11/master-key.aes"} -blockdev {"driver":"file","filename":"/run/libvirt/nix-ovmf/OVMF_CODE.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/win11_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"} -machine pc-q35-8.1,usb=off,vmport=off,dump-guest-core=off,memory-backend=pc.ram,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,hpet=off,acpi=on -accel kvm -cpu host,migratable=on,topoext=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff -m size=12288000k -object {"qom-type":"memory-backend-ram","id":"pc.ram","size":12582912000} -overcommit mem-lock=off -smp 16,sockets=1,dies=1,cores=8,threads=2 -uuid 643e2df5-3eb4-4d72-ace6-79b5acde720a -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=34,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device {"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"} -device {"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"} -device {"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"} -device {"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"} -device {"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"} -device {"driver":"qemu-xhci","p2":15,"p3":15,"id":"usb","bus":"pci.2","addr":"0x0"} -device {"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.3","addr":"0x0"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/images/win11.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-2-storage","backing":null} -device {"driver":"ide-hd","bus":"ide.0","drive":"libvirt-2-format","id":"sata0-0-0","bootindex":1} -device {"driver":"ide-cd","bus":"ide.1","id":"sata0-0-1"} -netdev {"type":"tap","fd":"35","id":"hostnet0"} -device {"driver":"e1000e","netdev":"hostnet0","id":"net0","mac":"52:54:00:b8:06:1a","bus":"pci.1","addr":"0x0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -chardev spicevmc,id=charchannel0,name=vdagent -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"com.redhat.spice.0"} -chardev socket,id=chrtpm,path=/run/libvirt/qemu/swtpm/7-win11-swtpm.sock -tpmdev emulator,id=tpm-tpm0,chardev=chrtpm -device {"driver":"tpm-crb","tpmdev":"tpm-tpm0","id":"tpm0"} -device {"driver":"usb-tablet","id":"input0","bus":"usb.0","port":"1"} -audiodev {"id":"audio1","driver":"spice"} -spice port=0,disable-ticketing=on,image-compression=off,seamless-migration=on -device {"driver":"qxl-vga","id":"video0","max_outputs":1,"ram_size":67108864,"vram_size":67108864,"vram64_size_mb":0,"vgamem_mb":16,"bus":"pcie.0","addr":"0x1"} -device {"driver":"ich9-intel-hda","id":"sound0","bus":"pcie.0","addr":"0x1b"} -device {"driver":"hda-duplex","id":"sound0-codec0","bus":"sound0.0","cad":0,"audiodev":"audio1"} -global ICH9-LPC.noreboot=off -watchdog-action reset -chardev spicevmc,id=charredir0,name=usbredir -device {"driver":"usb-redir","chardev":"charredir0","id":"redir0","bus":"usb.0","port":"2"} -chardev spicevmc,id=charredir1,name=usbredir -device {"driver":"usb-redir","chardev":"charredir1","id":"redir1","bus":"usb.0","port":"3"} -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.4","addr":"0x0"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

That being said, I’m not certain that it’s related to Qemu/KVM, but the correlation from 2 occurences is 100% so far.

It just happened again. Really weird. I’ll run without the VM for a while to see whether it’s related or just a coincidence.

Memtest86 looks good.

The kernel seems to find EFI pstore, but doesn’t log any crash info in there. I’ll try to check what other ways there are to debug this. Hints are welcome.

To rule out some bad efi firmware can you run (a/the) VM without EFI?

In the guest? Not really. Windows 11 doesn’t support legacy boot, I also don’t see how it’s relevant.

Hi @Julian_Stecklina
If you can, post the “dmesg” after it reboots. There should be some messages in there saying why it rebooted without you asking it to.
Its something called “mce”.

The system just resets. I’ve tried efi-pstore and ramoops to collect kernel logs, but to no avail. The kernel log saved by journald has no MCEs logged and just stops without any relevant messages.

If you can, post the “dmesg” after it reboots. There should be some messages in there saying why it rebooted without you asking it to.

How would the kernel know that if none of the pstore options work?

At this point, I don’t think that the kernel panics, but that the system just resets itself. Fatal MCE?

Not sure how to debug this on this client platform. Is there a way to enable ACPI ERST on the Framework? Or is there some persistent system error log?

The MCE does not use pstore.
Example output from “dmesg” after a reboot would be:
[ 0.107865] mce: [Hardware Error]: Machine check events logged
[ 0.107866] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110a
[ 0.107869] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086
[ 0.107872] mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1620583348 SOCKET 0 APIC 0 microcode e0

This is a log of the session that crashed: dmesg of crashed session - Pastebin.com

This is a log of the session afterwards (that’s still running): dmesg of next session - Pastebin.com

There are no MCEs logged. From the time difference you can also see that there were roughly 10min in which the crashed kernel had nothing to log at all.

This is using 6.8-rc6, but I see the same behavior on 6.7.x.

At this stage, anything I would have thought of has been asked here. This is a tough one. Feels like this may be best suited for a forum dedicated for VMs as I don’t think this is hardware related or Linux specific. Feels like something is “off” with how the config is being interpreted.

I don’t think it has anything to do with the VM itself. Nothing I do in the VM should hard reset the host! Regardless of how qemu is configured. :sweat_smile: At this point it can be anything, Linux being very broken, hardware/CPU issues. That’s why I wanted to reach out and see whether anyone else is experiencing this.

The same VM runs fine on an Intel system with roughly the same configuration of the host.

I’ve seen similar issues on server systems but they are typically rare and there is much better support for debugging them.

Any form of system error log (SEL) on the laptop would be really amazing to debug this.

If you have more physical machines, perhaps you could try setting up netconsole – i.e. pushing kernel logs over network to another machine, so that if the system crashes, you have a better chance catching the last messages (as compared to relying on the dmesg log reaching the hard drive).

1 Like

Hi @Julian_Stecklina
I agree. No MCE in there.
Maybe enable the CPU mitigations:
Currently kernel command line has this in it “mitigations=off”

It still happens to me and netconsole hasn’t managed to capture any logs. It still seems related to virtualization somehow. So weird.

I wonder if anyone else is seeing this.