Surprise crash under high CPU load

James3 · July 25, 2024, 2:52pm

@Jeff_Trull
That MCE line is normal. There would be a lot more MCE if there was a problem there.
So, you look OK from the point of view of MCE.

One way to get better log output is to use network logging. Configure the syslog to also send the logs to another Linux box with syslog running. You then need a physical network cable. it does not work with wifi. Then if it crashes, at least the latest logs are preserved and not lost in the crash.

Another possible cause of the crash might be overheating.
This is a complicated aspect to diagnose.
The only way to help here is to set a daemon running that logs all the temperatures periodically prior to the problem. It can then look for peaks in the temps later by looking in the logs.

Jeff_Trull · July 25, 2024, 3:42pm

Overheating was my original guess but since that time it has happened with the CPU idle or nearly so. I am currently leaning toward a memory issue - Framework support has me swapping DIMMs in and out and between slots. Unfortunately crashes are rare enough that any given experiment takes a week or so - but common (and catastrophic) enough to be concerning. I am hopeful that I can isolate it to a faulty DIMM - so far there is one DIMM that was always present in the system during a crash.

James3 · July 26, 2024, 1:11am

If you suspect memory, then run “memtest”

Jeff_Trull · July 26, 2024, 4:34am

Yes, memtest was one of my first moves. I told support that and they didn’t seem to care, so now I’m on what they call the “RAM shuffle” - moving the DIMMs around and seeing if that changes anything.

Khalid · August 22, 2024, 5:33am

Have you been able to solve your issue?
Or have you given up?

Would be great to know, thanks.

Jeff_Trull · August 22, 2024, 5:05pm

I wasn’t able to identify a problem DIMM, and in the meantime I got a SMART (“drive will soon fail”) warning from the 1TB nvme drive. It seemed like the kind of thing that could explain the problem (maybe?). So I switched drives, and so far, no crashes. I’ll close this topic if two weeks pass without a crash.

Khalid · August 22, 2024, 5:33pm

I wish you good luck!
I also exchanged my SSD meanwhile, unfortunately with no success - still crashing.
But I must also mention I never had such a warning, may you be more successful!

Jeff_Trull · August 28, 2024, 6:02am

If anyone is interested, my issue is likely resolved - no surprise shutdowns in several weeks. I have two theories about the source:

Bad nvme drive somehow took the system with it. After I took it offline (Framework supplied a replacement) the crash didn’t recur.
Foreign object in the wifi module area - Framework sent me a replacement mainboard (for a different problem!), and during the installation process I discovered that there was an additional plastic piece that normally covers the top of the wifi module (and keeps the antenna cables connected) underneath the wifi module. There was one such piece already on the module, for a total of two. You can see what I’m talking about here. Could it have been shorting something occasionally?

One thing I’m pretty sure of is it wasn’t the memory. I never found any correlation between DIMMs and crashes, and memtest was passing from the very beginning.

Khalid · August 31, 2024, 10:01am

Great to hear!