Surprise crash under high CPU load

Hi all, just experienced my first hard crash on FW16 (Ubuntu). Running Firefox, Emacs and a “make -j 6” on a big job, suddenly there was a loud buzzing sound from the speaker and the display shut off. Very high fan noise, slowly reducing over time until 1m 45s after the crash (I can see the gap in the logs) it rebooted without my assistance. Quite odd. Has anyone else observed this? Is there something I can do to prepare in case it happens again?

Nope. Didn’t see that, and I stress tested it quite a lot.
maybe “journalctl -b 1” shows what cause it to crash/restart.

journalctl -b 1 shows a log from a few weeks back. There are no clues in syslog or kernel.log other than the 1m 45s pause followed by a boot sequence.

Given the fan activity I suspect it overheated and BIOS (or something like that) took over

1 Like

I don’t think the BIOS takes over.
Try monitoring the temparature in a window before doing that again and check if it gets up that fast.
On my system, the fans spin up and become loud, but it does not crash.

My overheating theory is that some temperature sensor hit the “emergency shutdown” point, which would explain why there’s no record of it in the log (aside from the mysterious 1:45). It’s not reproducible and only happened one time. If I were the EE involved in figuring out the recovery path I would have a special circuit (or more likely, some kind of service processor) run the fans until it was safe to restart.

I’ve added a temperature sensor applet to my menu bar and if I think of it in time I’ll glance over but I suspect that the time between overheat shutdown and the display blanking will be too short for me to catch it. Who knows, maybe it won’t happen again!

Usually when a sensor hits, the linux kernel would mention it in the ring buffer.

So IMHO this is something else. Building my computers for 34+ years, and usually when you saw nothing on the OS side it was a hardware issue.
When you saw something on the OS side, it was a manufacturing problem (bad cable, card loose in slot etc., not correctly mounted cooler somewhere and also driver issues).

My best bet here is memory. If you have 2 ram modules, take one out and make a thorough memory stress test, then change the RAM stick, and do the same with the other.
Because the memory slot/ram module combination can also play a role, do the same on a different RAM slot. Only then can you be sure.

Update: new crash but this time, no high CPU, just normal usage. No fan noise. Just a buzzing sound and the disappearance of video. There is a 2m 20s gap in syslog plus a small region of NUL characters. No other symptoms.

My advice is to open up your device, and make sure all connections are seated right. Make sure Memory (especially) and NVMe disk connections are good. Verify all ribbon cables etc.

Of course, I hope you opened a support case for that.
There seem to be a hardware issue. that can come from memory or disk where it cannot write down the last bit of data and produces garbage at the end.

I have now submitted a Support request. I’ll run memtest tonight and try reseating everything tomorrow. Now up to 4 crashes :frowning:

memtest passes. I guess the nvme drive is a candidate for reseating.

What you can do, is configure journald to dump all the messages to a tty console.
This way, when it crashes you have the info on the screen (mnost of the time, you can still switch the virtual consoles with Ctrl-F2 up to F8, depending on how many have been configured by the distribution by default.

Hi Jeff_Trull,

hope you are well!
Did you resolve this issue (and how)?
I have a very similar looking issue since a few days, that’s why I am asking.

Thanks! :slight_smile:

I am still going through the analysis process, which involves removing both DIMMs, then trying each one independently in each slot. My crashes only occurred once per week on average, so it may take some time…

1 Like

Just to rule out software: try a stress test like prime95 as a high load alternative. One never knows…

Also, to mitigate, check in BIOS if there are (more) options (left) to cap clocking. You didn’t specify the CPU; if it’s Intel you can turn off Turbo boost for example.

1 Like

Which memory modules are you using, if I may ask :slight_smile:

The 32GB ones (x2) that came with my order.

There is something called a machine check exception.
If the CPU shuts itself down due to a MCE, on the next warm reboot the Linux kernel will collect the MCE log that can sometimes give a hint as to what went wrong.
So, when it happens again, do a “journal -b 0” and then look for “mce” or “MCE” and post the output here.
If it happened on previous boots, you can look in “syslog” or “messages” for older MCEs.

The only reference to mce in the logs is:

jet-framework-16 kernel: MCE: In-kernel MCE decoding enabled.

I don’t have journal and when I install it some user app comes up, so I think that’s not an Ubuntu thing.

FWIW I’ve never seen any messages in the log - one of the symptoms is a gap of 90 to 120s. There is no pattern to the events prior to the crash that I can discern.

That was a typo, he meant “journalctl”.
Anyway, this tool only shows the usual log files, which you already seem to have searched.