[RESPONDED] Changing the fan temperture points with ectool

Loell_Framework · August 24, 2023, 1:30am

Curious, are you able to take screenshot when this happens?

a_framework_owner · August 24, 2023, 7:23am

I’ll try and post it here although there isn’t much to see.

I get a battery critically low notification and something like 10 seconds later the laptop shuts down.

I am running Fedora’s cinnamon spin so it may be a Cinnamon bug ? It wouldn’t be the first bug I get

Edit: @Loell_Framework So yeah it happens too fast for me to screenshot it. Another piece of information is that I am maxing out my battery at 60% in the UEFI since it’s stationary.

Edit 2: Actually there’s a Cinnamon setting to “Do nothing” when the battery is extremely low (which I never let happen anyways). So I guess I jsut “fixed” my issue

Charlie_6 · April 24, 2024, 1:45am

Any updates about fw-fanctrl? AFAIK the fan control monitors the cpu_f75303@4d which is NOT the CPU temperature, the actual CPU temperature is cpu@4c but the fan only starts when the CPU is already thermal shutdown(103 C and 105 C), as shown

$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       319 K (= 46 C)          20% (313 K and 343 K)
cpu_f75303@4d         321 K (= 48 C)          25% (319 K and 327 K)
ddr_f75303@4d         315 K (= 42 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                365 K (= 92 C)           0% (376 K and 378 K)

Is it possible for fw-fanctrl to use the temp reading of cpu@4c and edit the fan curve on that one accordingly?

real_or_random · June 19, 2024, 10:50pm

I’ve been using this systemd service for a while, and I’m happy with it:

gist.github.com

https://gist.github.com/real-or-random/0c543b50b629c2b306c30877b712bc18

fw-fan-settings.service

[Unit]
Description=Sets fan parameters on Framework

# Try to avoid multiple ectool instances running at the same time
Before=fw-charge-limiter.service

[Service]
Type=oneshot
RemainAfterExit=true

This file has been truncated. show original

fw-fan-settings.sh

#!/bin/bash

# From https://stackoverflow.com/a/35977896
#
# Retries a command on failure.
# $1 - the max number of attempts
# $2... - the command to run
retry() {
   local -r -i max_attempts="$1"; shift
   local -r cmd="$@"

This file has been truncated. show original

Use this at your own risk, of course.

Alex_Keizer · September 7, 2024, 6:49pm

I’ve been having issues with my framework 16 overheating and shutting down, so I tried to follow the advice in this thread to configure my fans to be a bit more aggressive.

I managed to install ectool (I’m running NixOS, so I just installed the default version of fw-ectool available on nixpkgs), but it’s giving me a rather odd output, with a bunch of zeroes.

> sudo ectool thermalget
sensor  warn  high  halt   fan_off fan_max   name
  0      363   363    378      0       0     ambient_f75303@4d
  1      363   363    378      0       0     charger_f75303@4d
  2      363   363    378    320     335     apu_f75303@4d
  3      381   381    400    320     335     cpu@4c
  4        0     0      0      0       0     gpu_amb_f75303@4d
  5      344     0      0    323     347     gpu_vr_f75303@4d
  6        0     0      0      0       0     gpu_vram_f75303@4d
  7        0     0      0    323     353     gpu_amdr23m@40

Can anyone help me figure out what is going on here, and how I can make my laptop not overheat?

parawizard · September 8, 2024, 12:33am

Overheating and shutting down seems like a defect unless you ambient temp is like 40c+

Are you fans running at all? Or what is the situation this happens in?

As for the zeros I suspect those must mean there are either no temps or fan speeds set for those sensors.

Alex_Keizer · September 8, 2024, 8:54pm

It seems to happen specifically when my laptop is both plugged in to wall power and under load (specifically, light gaming, I haven’t had any trouble with CPU-only loads like compiling).

I suspect it’s the battery or charging circuit that is overheating, since even right after it forcefully reboots, and is still very warm to the touch, btop reports a CPU temp around 50 to 60 C, which feels very low for a laptop that literally just overheated.

The fans do run, but I don’t notice a difference in fan speeds when it’s plugged in (and running much hotter) versus when it’s not

parawizard · September 8, 2024, 9:08pm

Sounds like a problem with the board that should be investigated. It’s a hard shutdown with nothing weird in the logs?

Alex_Keizer · September 9, 2024, 1:52am

I tried to search through the logs with journalctl -g 'temperature' -S 2024-09-05, since I found some resources claiming that a shutdown because of overheat would be logged as “critical temperature reached”, but there were no entries that matched.

I guess that means it’s not the OS that’s deciding to reboot, but rather the board? I was a bit worried I’d caused this myself by using Nix (which I think is not officicially supported), but if it’s the board I should be safe.

Alex_Keizer · September 9, 2024, 1:55am

@a_framework_owner I’m curious if you managed to find out more about the root cause? I’m running into the same behaviour: once my laptop gets to about 10% battery it shuts down. I’m running Gnome on Nix, so I guess that rules out software. I did also set the battery limit (first at 60%, later tweaked it to 80%), I’m going to try turning that off and seeing if it fixes it

parawizard · September 9, 2024, 2:31am

I would be looking before one of these events to see if it says anything rather than something specific.

Alex_Keizer · September 9, 2024, 3:53am

Ah, there is in fact something weird. I just had a forced reboot (around 22:40 local time), and there are repeated hardware errors in the system log in the 20 minutes preceding.

Sep 08 22:25:20 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Error Addr: 0x00007f1e4a70ff40
Sep 08 22:25:20 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a00417a
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:25:20 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Error Addr: 0x00007f4d8c1a0e00
Sep 08 22:25:20 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a004170
Sep 08 22:25:20 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:25:20 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:25:40 nixos .xdg-desktop-po[2871]: Failed to stop screen cast session: GDBus.Error:org.freedesktop.DBus.Error.Failed: Session not s>
Sep 08 22:28:27 nixos kernel: perf: interrupt took too long (2540 > 2500), lowering kernel.perf_event_max_sample_rate to 78000
Sep 08 22:30:47 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85e01d00
Sep 08 22:30:47 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a004168
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:30:47 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85397f40
Sep 08 22:30:47 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a00417a
Sep 08 22:30:47 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:30:47 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:35:50 nixos kernel: perf: interrupt took too long (3187 > 3175), lowering kernel.perf_event_max_sample_rate to 62000
Sep 08 22:36:15 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: CPU:14 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85e01d00
Sep 08 22:36:15 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eaa00, Syndrome: 0x000000001a004168
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Sep 08 22:36:15 nixos kernel: mce: [Hardware Error]: Machine check events logged
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Corrected error, no action required.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: CPU:15 (19:74:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Error Addr: 0x01ffffff85de0b80
Sep 08 22:36:15 nixos kernel: [Hardware Error]: IPID: 0x000100b0200eab00, Syndrome: 0x000000001a00415c
Sep 08 22:36:15 nixos kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
Sep 08 22:36:15 nixos kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

parawizard · September 9, 2024, 4:23am

Doesn’t look good. Looks like a bad CPU from a quick googling. Things I would try:

Stress test CPU and check temperature across all cores and check log for messages like these.
Make sure your BIOS is up to date
Make sure your kernel is up to date
Confirm you’ve had those events around the time of other reboots.

But I am guessing you need a new mainboard. I might reach out to support right away with the logs.

Alex_Keizer · September 11, 2024, 5:50am

Yeah, I noticed that the errors were always about CPU 14 and 15 (and always both at the same time). I disabled those CPUs and the problem magically disappeared, it seems most likely I’ve got a bad core, so will indeed be reaching out to support.

@a_framework_owner disabling the suspected faulty CPU cores also made my battery-related shutdowns disappear. Since you described similar problems, you might want to check your syslog as well, to see if your CPU is also faulty

Topic		Replies	Views
[TRACKING] High temperatures and no fan response despite default fan curve Linux	12	1110	January 31, 2025
[RESPONDED] Controlling fan speed on the intel 13th gen in Linux Linux	4	1313	January 10, 2024
Framework 16 fan not working? Linux other-distro	6	599	September 9, 2024
How to change Framework fan curve? Fan related thermal issues Linux nixos	9	343	February 6, 2025
[RESPONDED] Fan and CPU temperature Linux	5	580	March 31, 2024

[RESPONDED] Changing the fan temperture points with ectool

Related topics