[RESPONDED] Changing the fan temperture points with ectool

Curious, are you able to take screenshot when this happens?

Iā€™ll try and post it here although there isnā€™t much to see.

I get a battery critically low notification and something like 10 seconds later the laptop shuts down.

I am running Fedoraā€™s cinnamon spin so it may be a Cinnamon bug ? It wouldnā€™t be the first bug I get

Edit: @Loell_Framework So yeah it happens too fast for me to screenshot it. Another piece of information is that I am maxing out my battery at 60% in the UEFI since itā€™s stationary.

Edit 2: Actually thereā€™s a Cinnamon setting to ā€œDo nothingā€ when the battery is extremely low (which I never let happen anyways). So I guess I jsut ā€œfixedā€ my issue

1 Like

Any updates about fw-fanctrl? AFAIK the fan control monitors the cpu_f75303@4d which is NOT the CPU temperature, the actual CPU temperature is cpu@4c but the fan only starts when the CPU is already thermal shutdown(103 C and 105 C), as shown

$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       319 K (= 46 C)          20% (313 K and 343 K)
cpu_f75303@4d         321 K (= 48 C)          25% (319 K and 327 K)
ddr_f75303@4d         315 K (= 42 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                365 K (= 92 C)           0% (376 K and 378 K)

Is it possible for fw-fanctrl to use the temp reading of cpu@4c and edit the fan curve on that one accordingly?

Iā€™ve been using this systemd service for a while, and Iā€™m happy with it:

Use this at your own risk, of course.

Iā€™ve been having issues with my framework 16 overheating and shutting down, so I tried to follow the advice in this thread to configure my fans to be a bit more aggressive.

I managed to install ectool (Iā€™m running NixOS, so I just installed the default version of fw-ectool available on nixpkgs), but itā€™s giving me a rather odd output, with a bunch of zeroes.

> sudo ectool thermalget
sensor  warn  high  halt   fan_off fan_max   name
  0      363   363    378      0       0     ambient_f75303@4d
  1      363   363    378      0       0     charger_f75303@4d
  2      363   363    378    320     335     apu_f75303@4d
  3      381   381    400    320     335     cpu@4c
  4        0     0      0      0       0     gpu_amb_f75303@4d
  5      344     0      0    323     347     gpu_vr_f75303@4d
  6        0     0      0      0       0     gpu_vram_f75303@4d
  7        0     0      0    323     353     gpu_amdr23m@40

Can anyone help me figure out what is going on here, and how I can make my laptop not overheat?

1 Like

Overheating and shutting down seems like a defect unless you ambient temp is like 40c+

Are you fans running at all? Or what is the situation this happens in?

As for the zeros I suspect those must mean there are either no temps or fan speeds set for those sensors.

It seems to happen specifically when my laptop is both plugged in to wall power and under load (specifically, light gaming, I havenā€™t had any trouble with CPU-only loads like compiling).

I suspect itā€™s the battery or charging circuit that is overheating, since even right after it forcefully reboots, and is still very warm to the touch, btop reports a CPU temp around 50 to 60 C, which feels very low for a laptop that literally just overheated.

The fans do run, but I donā€™t notice a difference in fan speeds when itā€™s plugged in (and running much hotter) versus when itā€™s not

Sounds like a problem with the board that should be investigated. Itā€™s a hard shutdown with nothing weird in the logs?

I tried to search through the logs with journalctl -g 'temperature' -S 2024-09-05, since I found some resources claiming that a shutdown because of overheat would be logged as ā€œcritical temperature reachedā€, but there were no entries that matched.

I guess that means itā€™s not the OS thatā€™s deciding to reboot, but rather the board? I was a bit worried Iā€™d caused this myself by using Nix (which I think is not officicially supported), but if itā€™s the board I should be safe.

@a_framework_owner Iā€™m curious if you managed to find out more about the root cause? Iā€™m running into the same behaviour: once my laptop gets to about 10% battery it shuts down. Iā€™m running Gnome on Nix, so I guess that rules out software. I did also set the battery limit (first at 60%, later tweaked it to 80%), Iā€™m going to try turning that off and seeing if it fixes it

I would be looking before one of these events to see if it says anything rather than something specific.

Ah, there is in fact something weird. I just had a forced reboot (around 22:40 local time), and there are repeated hardware errors in the system log in the 20 minutes preceding.

Sep 08 22:25:20 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Error Addr: 0x00007f1e4a70ff40
Sep 08 22:25:20 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a00417a
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:25:20 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Error Addr: 0x00007f4d8c1a0e00
Sep 08 22:25:20 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a004170
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:25:40 nixos .xdg-desktop-po[2871]: Failed to stop screen cast session: GDBus.Error:org.freedesktop.DBus.Error.Failed: Session not s>
Sep 08 22:28:27 nixos kernel: perf: interrupt took too long (2540 > 2500), lowering kernel.perf_event_max_sample_rate to 78000
Sep 08 22:30:47 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85e01d00
Sep 08 22:30:47 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a004168
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:30:47 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85397f40
Sep 08 22:30:47 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a00417a
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:35:50 nixos kernel: perf: interrupt took too long (3187 > 3175), lowering kernel.perf_event_max_sample_rate to 62000
Sep 08 22:36:15 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85e01d00
Sep 08 22:36:15 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a004168
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:36:15 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85de0b80
Sep 08 22:36:15 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a00415c
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

Doesnā€™t look good. Looks like a bad CPU from a quick googling. Things I would try:

Stress test CPU and check temperature across all cores and check log for messages like these.
Make sure your BIOS is up to date
Make sure your kernel is up to date
Confirm youā€™ve had those events around the time of other reboots.

But I am guessing you need a new mainboard. I might reach out to support right away with the logs.

Yeah, I noticed that the errors were always about CPU 14 and 15 (and always both at the same time). I disabled those CPUs and the problem magically disappeared, it seems most likely Iā€™ve got a bad core, so will indeed be reaching out to support.

@a_framework_owner disabling the suspected faulty CPU cores also made my battery-related shutdowns disappear. Since you described similar problems, you might want to check your syslog as well, to see if your CPU is also faulty

1 Like