Surprise crash under high CPU load

Hi all, just experienced my first hard crash on FW16 (Ubuntu). Running Firefox, Emacs and a “make -j 6” on a big job, suddenly there was a loud buzzing sound from the speaker and the display shut off. Very high fan noise, slowly reducing over time until 1m 45s after the crash (I can see the gap in the logs) it rebooted without my assistance. Quite odd. Has anyone else observed this? Is there something I can do to prepare in case it happens again?

Nope. Didn’t see that, and I stress tested it quite a lot.
maybe “journalctl -b 1” shows what cause it to crash/restart.

journalctl -b 1 shows a log from a few weeks back. There are no clues in syslog or kernel.log other than the 1m 45s pause followed by a boot sequence.

Given the fan activity I suspect it overheated and BIOS (or something like that) took over

1 Like

I don’t think the BIOS takes over.
Try monitoring the temparature in a window before doing that again and check if it gets up that fast.
On my system, the fans spin up and become loud, but it does not crash.

My overheating theory is that some temperature sensor hit the “emergency shutdown” point, which would explain why there’s no record of it in the log (aside from the mysterious 1:45). It’s not reproducible and only happened one time. If I were the EE involved in figuring out the recovery path I would have a special circuit (or more likely, some kind of service processor) run the fans until it was safe to restart.

I’ve added a temperature sensor applet to my menu bar and if I think of it in time I’ll glance over but I suspect that the time between overheat shutdown and the display blanking will be too short for me to catch it. Who knows, maybe it won’t happen again!

Usually when a sensor hits, the linux kernel would mention it in the ring buffer.

So IMHO this is something else. Building my computers for 34+ years, and usually when you saw nothing on the OS side it was a hardware issue.
When you saw something on the OS side, it was a manufacturing problem (bad cable, card loose in slot etc., not correctly mounted cooler somewhere and also driver issues).

My best bet here is memory. If you have 2 ram modules, take one out and make a thorough memory stress test, then change the RAM stick, and do the same with the other.
Because the memory slot/ram module combination can also play a role, do the same on a different RAM slot. Only then can you be sure.