Curious, are you able to take screenshot when this happens?
Iāll try and post it here although there isnāt much to see.
I get a battery critically low notification and something like 10 seconds later the laptop shuts down.
I am running Fedoraās cinnamon spin so it may be a Cinnamon bug ? It wouldnāt be the first bug I get
Edit: @Loell_Framework So yeah it happens too fast for me to screenshot it. Another piece of information is that I am maxing out my battery at 60% in the UEFI since itās stationary.
Edit 2: Actually thereās a Cinnamon setting to āDo nothingā when the battery is extremely low (which I never let happen anyways). So I guess I jsut āfixedā my issue
Any updates about fw-fanctrl? AFAIK the fan control monitors the cpu_f75303@4d
which is NOT the CPU temperature, the actual CPU temperature is cpu@4c
but the fan only starts when the CPU is already thermal shutdown(103 C and 105 C), as shown
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d 319 K (= 46 C) 20% (313 K and 343 K)
cpu_f75303@4d 321 K (= 48 C) 25% (319 K and 327 K)
ddr_f75303@4d 315 K (= 42 C) N/A (fan_off=401 K, fan_max=401 K)
cpu@4c 365 K (= 92 C) 0% (376 K and 378 K)
Is it possible for fw-fanctrl to use the temp reading of cpu@4c
and edit the fan curve on that one accordingly?
Iāve been using this systemd service for a while, and Iām happy with it:
Use this at your own risk, of course.
Iāve been having issues with my framework 16 overheating and shutting down, so I tried to follow the advice in this thread to configure my fans to be a bit more aggressive.
I managed to install ectool (Iām running NixOS, so I just installed the default version of fw-ectool
available on nixpkgs), but itās giving me a rather odd output, with a bunch of zeroes.
> sudo ectool thermalget
sensor warn high halt fan_off fan_max name
0 363 363 378 0 0 ambient_f75303@4d
1 363 363 378 0 0 charger_f75303@4d
2 363 363 378 320 335 apu_f75303@4d
3 381 381 400 320 335 cpu@4c
4 0 0 0 0 0 gpu_amb_f75303@4d
5 344 0 0 323 347 gpu_vr_f75303@4d
6 0 0 0 0 0 gpu_vram_f75303@4d
7 0 0 0 323 353 gpu_amdr23m@40
Can anyone help me figure out what is going on here, and how I can make my laptop not overheat?
Overheating and shutting down seems like a defect unless you ambient temp is like 40c+
Are you fans running at all? Or what is the situation this happens in?
As for the zeros I suspect those must mean there are either no temps or fan speeds set for those sensors.
It seems to happen specifically when my laptop is both plugged in to wall power and under load (specifically, light gaming, I havenāt had any trouble with CPU-only loads like compiling).
I suspect itās the battery or charging circuit that is overheating, since even right after it forcefully reboots, and is still very warm to the touch, btop
reports a CPU temp around 50 to 60 C, which feels very low for a laptop that literally just overheated.
The fans do run, but I donāt notice a difference in fan speeds when itās plugged in (and running much hotter) versus when itās not
Sounds like a problem with the board that should be investigated. Itās a hard shutdown with nothing weird in the logs?
I tried to search through the logs with journalctl -g 'temperature' -S 2024-09-05
, since I found some resources claiming that a shutdown because of overheat would be logged as ācritical temperature reachedā, but there were no entries that matched.
I guess that means itās not the OS thatās deciding to reboot, but rather the board? I was a bit worried Iād caused this myself by using Nix (which I think is not officicially supported), but if itās the board I should be safe.
@a_framework_owner Iām curious if you managed to find out more about the root cause? Iām running into the same behaviour: once my laptop gets to about 10% battery it shuts down. Iām running Gnome on Nix, so I guess that rules out software. I did also set the battery limit (first at 60%, later tweaked it to 80%), Iām going to try turning that off and seeing if it fixes it
I would be looking before one of these events to see if it says anything rather than something specific.
Ah, there is in fact something weird. I just had a forced reboot (around 22:40 local time), and there are repeated hardware errors in the system log in the 20 minutes preceding.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Error Addr: 0x00007f1e4a70ff40
Sep 08 22:25:20 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a00417a
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:25:20 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Error Addr: 0x00007f4d8c1a0e00
Sep 08 22:25:20 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a004170
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:25:40 nixos .xdg-desktop-po[2871]: Failed to stop screen cast session: GDBus.Error:org.freedesktop.DBus.Error.Failed: Session not s>
Sep 08 22:28:27 nixos kernel: perf: interrupt took too long (2540 > 2500), lowering kernel.perf_event_max_sample_rate to 78000
Sep 08 22:30:47 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85e01d00
Sep 08 22:30:47 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a004168
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:30:47 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85397f40
Sep 08 22:30:47 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a00417a
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:35:50 nixos kernel: perf: interrupt took too long (3187 > 3175), lowering kernel.perf_event_max_sample_rate to 62000
Sep 08 22:36:15 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85e01d00
Sep 08 22:36:15 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a004168
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:36:15 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85de0b80
Sep 08 22:36:15 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a00415c
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Doesnāt look good. Looks like a bad CPU from a quick googling. Things I would try:
Stress test CPU and check temperature across all cores and check log for messages like these.
Make sure your BIOS is up to date
Make sure your kernel is up to date
Confirm youāve had those events around the time of other reboots.
But I am guessing you need a new mainboard. I might reach out to support right away with the logs.
Yeah, I noticed that the errors were always about CPU 14 and 15 (and always both at the same time). I disabled those CPUs and the problem magically disappeared, it seems most likely Iāve got a bad core, so will indeed be reaching out to support.
@a_framework_owner disabling the suspected faulty CPU cores also made my battery-related shutdowns disappear. Since you described similar problems, you might want to check your syslog as well, to see if your CPU is also faulty