Surprise crash under high CPU load

Jeff_Trull · June 11, 2024, 6:32pm

Hi all, just experienced my first hard crash on FW16 (Ubuntu). Running Firefox, Emacs and a “make -j 6” on a big job, suddenly there was a loud buzzing sound from the speaker and the display shut off. Very high fan noise, slowly reducing over time until 1m 45s after the crash (I can see the gap in the logs) it rebooted without my assistance. Quite odd. Has anyone else observed this? Is there something I can do to prepare in case it happens again?

Jorg_Mertin · June 11, 2024, 7:13pm

Nope. Didn’t see that, and I stress tested it quite a lot.
maybe “journalctl -b 1” shows what cause it to crash/restart.

Jeff_Trull · June 11, 2024, 11:00pm

journalctl -b 1 shows a log from a few weeks back. There are no clues in syslog or kernel.log other than the 1m 45s pause followed by a boot sequence.

Jeff_Trull · June 11, 2024, 11:01pm

Given the fan activity I suspect it overheated and BIOS (or something like that) took over

Jorg_Mertin · June 12, 2024, 7:25am

I don’t think the BIOS takes over.
Try monitoring the temparature in a window before doing that again and check if it gets up that fast.
On my system, the fans spin up and become loud, but it does not crash.

Jeff_Trull · June 12, 2024, 5:36pm

My overheating theory is that some temperature sensor hit the “emergency shutdown” point, which would explain why there’s no record of it in the log (aside from the mysterious 1:45). It’s not reproducible and only happened one time. If I were the EE involved in figuring out the recovery path I would have a special circuit (or more likely, some kind of service processor) run the fans until it was safe to restart.

I’ve added a temperature sensor applet to my menu bar and if I think of it in time I’ll glance over but I suspect that the time between overheat shutdown and the display blanking will be too short for me to catch it. Who knows, maybe it won’t happen again!

Jorg_Mertin · June 13, 2024, 8:34am

Usually when a sensor hits, the linux kernel would mention it in the ring buffer.

So IMHO this is something else. Building my computers for 34+ years, and usually when you saw nothing on the OS side it was a hardware issue.
When you saw something on the OS side, it was a manufacturing problem (bad cable, card loose in slot etc., not correctly mounted cooler somewhere and also driver issues).

My best bet here is memory. If you have 2 ram modules, take one out and make a thorough memory stress test, then change the RAM stick, and do the same with the other.
Because the memory slot/ram module combination can also play a role, do the same on a different RAM slot. Only then can you be sure.

Jeff_Trull · June 16, 2024, 9:16pm

Update: new crash but this time, no high CPU, just normal usage. No fan noise. Just a buzzing sound and the disappearance of video. There is a 2m 20s gap in syslog plus a small region of NUL characters. No other symptoms.

Jorg_Mertin · June 17, 2024, 7:23am

My advice is to open up your device, and make sure all connections are seated right. Make sure Memory (especially) and NVMe disk connections are good. Verify all ribbon cables etc.

Of course, I hope you opened a support case for that.
There seem to be a hardware issue. that can come from memory or disk where it cannot write down the last bit of data and produces garbage at the end.

Jeff_Trull · June 25, 2024, 8:36pm

I have now submitted a Support request. I’ll run memtest tonight and try reseating everything tomorrow. Now up to 4 crashes

Jeff_Trull · June 27, 2024, 5:28pm

memtest passes. I guess the nvme drive is a candidate for reseating.

Jorg_Mertin · June 28, 2024, 8:31am

What you can do, is configure journald to dump all the messages to a tty console.
This way, when it crashes you have the info on the screen (mnost of the time, you can still switch the virtual consoles with Ctrl-F2 up to F8, depending on how many have been configured by the distribution by default.

Khalid · July 18, 2024, 10:15am

Hi Jeff_Trull,

hope you are well!
Did you resolve this issue (and how)?
I have a very similar looking issue since a few days, that’s why I am asking.

Thanks!

Jeff_Trull · July 18, 2024, 4:47pm

I am still going through the analysis process, which involves removing both DIMMs, then trying each one independently in each slot. My crashes only occurred once per week on average, so it may take some time…

suliblian · July 20, 2024, 7:20am

Just to rule out software: try a stress test like prime95 as a high load alternative. One never knows…

Also, to mitigate, check in BIOS if there are (more) options (left) to cap clocking. You didn’t specify the CPU; if it’s Intel you can turn off Turbo boost for example.

Khalid · July 24, 2024, 11:01am

Which memory modules are you using, if I may ask …

Jeff_Trull · July 24, 2024, 5:03pm

The 32GB ones (x2) that came with my order.

James3 · July 24, 2024, 9:57pm

There is something called a machine check exception.
If the CPU shuts itself down due to a MCE, on the next warm reboot the Linux kernel will collect the MCE log that can sometimes give a hint as to what went wrong.
So, when it happens again, do a “journal -b 0” and then look for “mce” or “MCE” and post the output here.
If it happened on previous boots, you can look in “syslog” or “messages” for older MCEs.

Jeff_Trull · July 24, 2024, 10:13pm

The only reference to mce in the logs is:

jet-framework-16 kernel: MCE: In-kernel MCE decoding enabled.

I don’t have journal and when I install it some user app comes up, so I think that’s not an Ubuntu thing.

FWIW I’ve never seen any messages in the log - one of the symptoms is a gap of 90 to 120s. There is no pattern to the events prior to the crash that I can discern.

Khalid · July 25, 2024, 8:42am

That was a typo, he meant “journalctl”.
Anyway, this tool only shows the usual log files, which you already seem to have searched.

Topic		Replies	Views
Unexpected reboots when launching games Framework Laptop 16 framework-laptop-16-amd-7040	12	850	November 18, 2024
Framework 16 Randomly Shuts Off Community Support framework-laptop-16-amd-7040	57	1592	May 7, 2025
Browsers keep causing kernel panic Linux	3	1172	June 6, 2022
FRWK16 - Random Crash then Reboots Community Support framework-laptop-16-amd-7040	80	1276	May 15, 2025
FW13 AMD random reboots Community Support framework-laptop-13-amd-7040 , freezing-crashing	16	708	May 16, 2025

Surprise crash under high CPU load

Related topics