[RESPONDED] Hard resets running VMs on AMD 7640U

I have a Framework 13 with AMD Ryzen 7 7840U. I’ve experienced hard resets running Windows 11 in a VM on Linux using Qemu. It happened twice so far roughly a week apart.

The symptom is that the laptop just reboots. There is nothing in the logs.

I’m mostly leaving this here to see whether other people also have this problem.

1 Like

Nothing in which logs? System logs? Kernel logs? User logs?

Can you list us the QEMU arguments?

No, there is nothing in either log. My qemu args:

/run/libvirt/nix-emulators/qemu-system-x86_64 -name guest=win11,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-7-win11/master-key.aes"} -blockdev {"driver":"file","filename":"/run/libvirt/nix-ovmf/OVMF_CODE.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/win11_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"} -machine pc-q35-8.1,usb=off,vmport=off,dump-guest-core=off,memory-backend=pc.ram,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,hpet=off,acpi=on -accel kvm -cpu host,migratable=on,topoext=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff -m size=12288000k -object {"qom-type":"memory-backend-ram","id":"pc.ram","size":12582912000} -overcommit mem-lock=off -smp 16,sockets=1,dies=1,cores=8,threads=2 -uuid 643e2df5-3eb4-4d72-ace6-79b5acde720a -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=34,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device {"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"} -device {"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"} -device {"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"} -device {"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"} -device {"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"} -device {"driver":"qemu-xhci","p2":15,"p3":15,"id":"usb","bus":"pci.2","addr":"0x0"} -device {"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.3","addr":"0x0"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/images/win11.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-2-storage","backing":null} -device {"driver":"ide-hd","bus":"ide.0","drive":"libvirt-2-format","id":"sata0-0-0","bootindex":1} -device {"driver":"ide-cd","bus":"ide.1","id":"sata0-0-1"} -netdev {"type":"tap","fd":"35","id":"hostnet0"} -device {"driver":"e1000e","netdev":"hostnet0","id":"net0","mac":"52:54:00:b8:06:1a","bus":"pci.1","addr":"0x0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -chardev spicevmc,id=charchannel0,name=vdagent -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"com.redhat.spice.0"} -chardev socket,id=chrtpm,path=/run/libvirt/qemu/swtpm/7-win11-swtpm.sock -tpmdev emulator,id=tpm-tpm0,chardev=chrtpm -device {"driver":"tpm-crb","tpmdev":"tpm-tpm0","id":"tpm0"} -device {"driver":"usb-tablet","id":"input0","bus":"usb.0","port":"1"} -audiodev {"id":"audio1","driver":"spice"} -spice port=0,disable-ticketing=on,image-compression=off,seamless-migration=on -device {"driver":"qxl-vga","id":"video0","max_outputs":1,"ram_size":67108864,"vram_size":67108864,"vram64_size_mb":0,"vgamem_mb":16,"bus":"pcie.0","addr":"0x1"} -device {"driver":"ich9-intel-hda","id":"sound0","bus":"pcie.0","addr":"0x1b"} -device {"driver":"hda-duplex","id":"sound0-codec0","bus":"sound0.0","cad":0,"audiodev":"audio1"} -global ICH9-LPC.noreboot=off -watchdog-action reset -chardev spicevmc,id=charredir0,name=usbredir -device {"driver":"usb-redir","chardev":"charredir0","id":"redir0","bus":"usb.0","port":"2"} -chardev spicevmc,id=charredir1,name=usbredir -device {"driver":"usb-redir","chardev":"charredir1","id":"redir1","bus":"usb.0","port":"3"} -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.4","addr":"0x0"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

That being said, I’m not certain that it’s related to Qemu/KVM, but the correlation from 2 occurences is 100% so far.

It just happened again. Really weird. I’ll run without the VM for a while to see whether it’s related or just a coincidence.

Memtest86 looks good.

The kernel seems to find EFI pstore, but doesn’t log any crash info in there. I’ll try to check what other ways there are to debug this. Hints are welcome.

To rule out some bad efi firmware can you run (a/the) VM without EFI?

In the guest? Not really. Windows 11 doesn’t support legacy boot, I also don’t see how it’s relevant.

Hi @Julian_Stecklina
If you can, post the “dmesg” after it reboots. There should be some messages in there saying why it rebooted without you asking it to.
Its something called “mce”.

The system just resets. I’ve tried efi-pstore and ramoops to collect kernel logs, but to no avail. The kernel log saved by journald has no MCEs logged and just stops without any relevant messages.

If you can, post the “dmesg” after it reboots. There should be some messages in there saying why it rebooted without you asking it to.

How would the kernel know that if none of the pstore options work?

At this point, I don’t think that the kernel panics, but that the system just resets itself. Fatal MCE?

Not sure how to debug this on this client platform. Is there a way to enable ACPI ERST on the Framework? Or is there some persistent system error log?

The MCE does not use pstore.
Example output from “dmesg” after a reboot would be:
[ 0.107865] mce: [Hardware Error]: Machine check events logged
[ 0.107866] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110a
[ 0.107869] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086
[ 0.107872] mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1620583348 SOCKET 0 APIC 0 microcode e0

This is a log of the session that crashed: dmesg of crashed session - Pastebin.com

This is a log of the session afterwards (that’s still running): dmesg of next session - Pastebin.com

There are no MCEs logged. From the time difference you can also see that there were roughly 10min in which the crashed kernel had nothing to log at all.

This is using 6.8-rc6, but I see the same behavior on 6.7.x.

At this stage, anything I would have thought of has been asked here. This is a tough one. Feels like this may be best suited for a forum dedicated for VMs as I don’t think this is hardware related or Linux specific. Feels like something is “off” with how the config is being interpreted.

I don’t think it has anything to do with the VM itself. Nothing I do in the VM should hard reset the host! Regardless of how qemu is configured. :sweat_smile: At this point it can be anything, Linux being very broken, hardware/CPU issues. That’s why I wanted to reach out and see whether anyone else is experiencing this.

The same VM runs fine on an Intel system with roughly the same configuration of the host.

I’ve seen similar issues on server systems but they are typically rare and there is much better support for debugging them.

Any form of system error log (SEL) on the laptop would be really amazing to debug this.

If you have more physical machines, perhaps you could try setting up netconsole – i.e. pushing kernel logs over network to another machine, so that if the system crashes, you have a better chance catching the last messages (as compared to relying on the dmesg log reaching the hard drive).

1 Like

Hi @Julian_Stecklina
I agree. No MCE in there.
Maybe enable the CPU mitigations:
Currently kernel command line has this in it “mitigations=off”

It still happens to me and netconsole hasn’t managed to capture any logs. It still seems related to virtualization somehow. So weird.

I wonder if anyone else is seeing this.

I’m seeing this now. I’ve hard reset, no logs, no warning, running win11 on a VM twice tonight.

For what it’s worth, I was downloading an ISO via Rufus in Win11 to make a windows 11 bootdisk both times.

I had an 16GB image as a virtual, removable USB drive selected in rufus.

What CPU model are you using for passthrough to the VM? If you are using host-passthrough it’s likely that there are exposed MSR’s which windows is potentially mucking around with once booting.

I would change the cpu in the libvirt definition to something which isn’t host-passthrough and see if you can replicate it.

Hello all,

Weird behavior I’ve just taken note of :

Running a Garuda Linux VM on NixOS (and inside this VM, I’m starting a Windows 11 VM), this cause the laptop to shutdown almost instantly after a few minutes once the Windows VM has booted to install.

I have no logs on NixOS (host system) and I’m plugged to AC (+ battery fully charged).

I think nested virtualization with Windows VM at level 2 (first level VM Linux), may be specifically causing a weird issue but I may be wrong, I would like to know if anyone else is running into those issues.

Currently running the old BIOS 3.03, will bump and try and reproduce on BIOS 3.05, any others users have attempted to use nested virtualization and have ran into similar issues ?

Framework specs : 64GB RAM DDR5 (official RAM sticks from Framework), Ryzen 7 7840U, SSD Samsung Evo 840 2To, Linux kernel (LTS) 6.1.90

1 Like

Please change the CPU model to something other than host / host-pass-through i.e kvm64 ; I can almost guarantee there is a feature which is being exposed by host/host-pass-through that windows is mangling, if you start with kvm64 then slowly expose individual features of the host model as feature flags on top of the base kvm64 set you can probably bisect it. This is MUCH easier to do with a libvirt xml than one-shots from CLI so recommend defining your oneshot into a libvirt xml.

3 Likes

Will try and report in a week or two, thanks !