[RESPONDED] FW 13 7840U ACPI thermal readout problem

Guest68 · August 20, 2024, 2:12am

Looking into the EC log has not been helpful. I wasn’t sure what to make of the 4 digit groupings of the port 80 / POST codes, but no matter how I parse them, most are not on the list shared by NRP, which may well be out of date and/or not apply to the AMD boards even if it was current. One post suggested that the codes may be LSB, but that didn’t seem help here. It’s entirely possible I’m not reading them correctly, but I don’t know what else to look for.

POST codes aside, none of the other lines were particularly meaningful to me either.

[xxxxxx.xxxxxx SB-SMI: Mailbox transfer timeout]
[xxxxxx.xxxxxx SB-RMI Error: 4]

Something AMD platform related (kernel driver docs for the interface); doesn’t look relevant.

[xxxxxx.xxxxxx HC 0x0115 err 1]

A “host command” related to reading and deleting PD logs.

[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]

Produced on each run of ectool console to fetch the EC version and read protocol info.

Additionally, reading the logs when there was no error, waiting a few seconds for a sensor read to fail, and rerunning the command showed that there were no particular log entries generated by the failed read, as demonstrated by the log ending in

[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]
[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]

If the POST codes are written to the log out of chronological order with other entries (due to polling or something), then it’s possible my “window test” wouldn’t have captured the log, but it happens so often that it should have shown up earlier in the logs.
Perhaps the EC only emits log entries for repeated failed reads after a particular interval to prevent flooding the buffer, but I haven’t caught anything that stood out to me.

Unless someone else has another idea, I think all that’s left for me to do is to see if Stephen Horvath has anything important to add and then contact Framework support.

Steve-Tech · August 20, 2024, 3:29am

Hi Guest68, I just got your email.

Sorry, I never got a chance to create an issue. I got distracted with trying to learn how the EC works and maybe fix it myself, and then forgot about it all together.

I did however start writing a draft in my notes app if it helps:

Hi, the sensor cpu@4c seems to intermittently return EC_TEMP_SENSOR_ERROR (0xfe). The same sensor using ACPI also returns 180.8°C while this is happening, which seems very wrong.

It seems to occur more frequently while unplugged from the charger, but it can occur while charging.

I’m not too experienced with the CrOS EC codebase, but can’t find anything obvious in the code that would cause it. So I’m hoping it’s not a interference/hardware issue, but I wouldn’t be surprised if it is.

This also seems to occur for @t-8ch while discussing a Linux hwmon driver.

Command Outputs:

sensors acpitz-acpi-0

acpitz-acpi-0
Adapter: ACPI interface
local_f75303@4d:  +32.8°C
cpu_f75303@4d:    +32.8°C
ddr_f75303@4d:    +30.8°C
cpu@4c:          +180.8°C

ectool temps all

--sensor name -------- temperature -------- fan speed --
local_f75303@4d       307 K (= 34 C)           0%
cpu_f75303@4d         306 K (= 33 C)           0%
ddr_f75303@4d         304 K (= 31 C)          -1%
Sensor 3 error

A Python script using my CrOS_EC_Python.

Temp Sensor 0: 308K (35°C)
Temp Sensor 1: 308K (35°C)
Temp Sensor 2: 306K (33°C)
Temp Sensor 3: Error 0xfe

Guest68 · August 20, 2024, 4:08am

Thanks for the input. Interestingly I also observe the issue primarily when on battery power, but I hadn’t made that connection yet. I’ll put together a support ticket submission soon with the combined information.

Guest68 · August 25, 2024, 5:00am

Have either of you (or anyone else) observed this issue on one of the supported distros? I have been unable to reproduce it on a live boot of either due to them both lacking ACPITZ sensors in the sysfs. Support of course wants to see it on a supported distro though, and I can’t provide.

Thomas_Weissschuh · August 25, 2024, 6:45pm

No. But given that this come straight from ectool, talking directly to the EC, the distro really shouldn’t matter.

Guest68 · August 25, 2024, 8:45pm

Indeed, but they still ask.

Steve-Tech · August 25, 2024, 9:41pm

No sorry, but AFAIK only kernel 6.8 won’t show the ACPI sensors, so you could upgrade or downgrade the kernel. I’m not sure if support would like that though, maybe an OEM kernel would be more acceptable for them? Also it should still error in ectool.

Guest68 · August 25, 2024, 11:48pm

Ah, so it is a kernel 6.8 issue. Since I fully expect to be able to reproduce it with ectool though, that sounds the the easier option if support insists on proof from a supported distro. Thanks.

Thomas_Weissschuh · August 26, 2024, 7:31am

The issue that 0xfe from the EC is reported as 108 degrees is the first issue and needs newer kernels.

The issue that 0xfe is reported in the first place can be reproduced without kernel involvement through ectool.

Some more notes about where the error comes from:

The sensor cpu@4c is of type amd,sb-tsi.
The driver code in the EC repo is in driver/temp_sensor/sb_tsi.c, the only real place where the error could come from is the call to i2c_read8().

Guest68 · September 4, 2024, 6:22pm

Interesting. Thanks for the info. I’ll relay it to support to make sure they are aware, but it’s looking like it’ll be a while until we get somewhere meaningful on that front.

I have discovered some other interesting behavior. It seems that during system load (as simulated by stress -c 16) the spurious readings do not occur, though they may be present immediately before and afterward. I have now also observed infrequent but reoccurring momentary reports of 181.8 from the local_f75303@4d sensor as well. Following your formula, I believe this yields a reading of 0xff or EC_TEMP_SENSOR_NOT_PRESENT.

These errors are not the only poor readings I’ve recorded. Following FW support’s request for psensor screenshots, I notice that cpu@4c readings also regularly spike to elevated and unrealistic values in a roughly 20-30° range around 100-130°C. local_f75303@4d occasionally does something similar as well, but the spike values I’ve seen for it appear to be more consistent at about 98°. Does anyone have an idea about what might be causing these strange readings?

Frustratingly, the naming of ACPITZ sensors differs between the sysfs, lm-sensors, and acpi, making consistent sensor identification challenging.

This relationship seems to be correct to me:

ectool	sysfs	lm-sensors	`acpi`
local_f75303@4d	thermal_zone0	temp1	Thermal 2
cpu_f75303@4d	thermal_zone1	temp2	Thermal 0
ddr_f75303@4d	thermal_zone2	temp3	Thermal 3
cpu@4c	thermal_zone3	temp4	Thermal 1

Here are two screenshots of psensor showing the above observations. The blocks of elevated temperature readings around 100° in both are from running stress. Note that psensor rounds readings to the nearest whole degree and uses lm-sensor for it’s data source.

Thomas_Weissschuh · September 5, 2024, 6:37pm

I guess you need that relationship for psensors?
Otherwise I would stick to the ectool names.
You can also install Linux v6.11 (rc) to get the native CrOS EC hwmon driver with the correct names and error values.
(And no firmware shenanigans in between)

Steve-Tech · September 6, 2024, 1:43am

My understanding from Guenter in the original mailing list thread is 0xff should never be returned dynamically, it should be compiled into the firmware when it literally has no support for more sensors. So it’s quite interesting that you’re getting that.

If it helps, I put together a quick python script to read the raw bytes from the memmap:

#!/usr/bin/env python3

EC_LPC_ADDR_MEMMAP = 0xE00  # 0x900 on Intel Frameworks
EC_MEMMAP_TEMP_SENSOR = 0x00
EC_TEMP_SENSOR_ENTRIES = 4  # Usually 16, but there's only 4 here anyway
EC_TEMP_SENSOR_OFFSET = 200

# /dev/port needs root!
with open("/dev/port", "rb") as f:
    f.seek(EC_LPC_ADDR_MEMMAP + EC_MEMMAP_TEMP_SENSOR)
    data = list(f.read(EC_TEMP_SENSOR_ENTRIES))
    for i, byte in enumerate(data):
        tempK = byte + EC_TEMP_SENSOR_OFFSET
        tempC = tempK - 273
        print(f"Sensor {i}: 0x{byte:02X} ({tempC}°C)")

Guest68 · September 6, 2024, 3:50am

I guess you need that relationship for psensors?

Correct. I use the ectool names for clarity, but since this is 6.10 the chart is helpful to match with psensor. It also contributes a bit of documentation to public record that I haven’t seen anywhere else. Hopefully it’ll soon be unnecessary due to 6.11, but since it was necessary here, I might as well publish it.

You can also install Linux v6.11 (rc) to get the native CrOS EC hwmon driver with the correct names and error values.

I’ve considered it before, but put it off so far. I think I will though; I expect it’ll help a fair bit to have the direct comparison.

Guest68 · September 6, 2024, 3:57am

0xff should never be returned dynamically

Interesting. We’ll see what I find I guess.

If it helps, I put together a quick python script to read the raw bytes from the memmap

Nice, thanks.

Guest68 · September 10, 2024, 5:09am

Good news: my support ticket has reached the level of Matt Hartley and he reports that he is sending it over to the engineering team as well as putting together an in-house ticket. Hopefully this is not the end of the road and we’ll get updated information as they investigate.

I haven’t finished looking into the additional strange behaviors I noted in my message with the psensor screenshots above, but in light of this development, I’ll share some preliminary results to help refine the problem scope.

I have not been able to independently capture any of the “intermediate” values of cpu@4c that the psensor graphs seemed to report. I tested by running ectool temps all and Stephen’s python script in a Bash loop repeating every 0.5 seconds while psensor was graphing and watching for any notable value in the command output during a time window when psensor displayed frequent “half height” spikes. This isn’t exactly a scientific test, but it seemed to me to be a good enough quick sanity check to remove it from the list of definitively identified issues since there are other possible explanations. These values may instead be produced by some quirk of psensor, perhaps in averaging values or somewhere along the process of drawing the graph. They my also originate somewhere further up the chain than the EC, and consequently I wouldn’t have caught them reading only the EC interface.

Similarly, I have not been able to capture an EC reading with a value matching the 182° that psensor continues to occasionally record for local_f75303@4d. This time my bash script dumped output to a log file that I then checked with grep [1]. It is possible here too that this value originates somewhere downstream of the EC, but I have also not yet tested other interfaces to check. I currently doubt that it is being introduced by psensor itself, but this of course remains unproven. It seems more likely to me that this value is “legitimately” produced somewhere, but I have no evidence at this time. The “intermediate” values of local_f75303@4d that occasionally appear on the psensor graph may also be phantom artifacts created by psensor through the same operation as the ones for cpu@4c.

More testing is needed to better pare down the possibilities for these two/three issues, but I probably won’t have the time to do it myself for a little while. Hopefully now that the engineering team is involved, they’ll be able to reproduce and debug these issues more formally than I can do in my spare time.

[1] I originally parsed before logging, but after several psensor spikes to 182 for which I caught nothing, I just logged everything to make sure it wasn’t a bug in the script.

pierce · December 15, 2024, 6:43pm

I’ve seen the “cpu@4c” “Sensor 3 error” / fault sporadically in the past, buts starting with linux-6.11 (I think) where the cros_ec module properly supports my AMD FW13, and cros_ec_hwmon loads by default and gets these sensor readings where standard tools can see them (lm_sensors etc), I see long runs of constant failure of this sensor, generally until suspend/resume. I was able to avoid this by blacklisting cros_ec_hwmon in /etc/modprobe.d/.

So, my guess is that this sensors is more likely to fail, if it is frequently accessed. Perhaps this is because the i2c bus it is on, or the sensor chip’s i2c iface, has a race-condition kind of fault, and the more you access it the more chances of the bus or chip “locking up”, and then suspend/resume can reset it.

As noted elsewhere, when this fault / “Sensor 3 error” is persistent, the “autofanctrl” in the EC also stops working. So that also reads this sensor regularly, but it usually doesn’t cause the persistent error/fault. So maybe there’s a race-condition between on-demand query to the sensor, via ectool (or the hwmon module), and the EC “autofanctrl” query to the same sensor?

Charlie_6 · December 16, 2024, 7:30am

I found that the “Sensor 3 error” happens mostly when the CPU is at 29C. This is not caused by suspend/resume in my opinion.

When the ambient temperature is temperate or warm, the local_f75303@4d temperature is maintained at about 40C to 43C with fan alternates between stopped and spinning at lowest RPM when doing light task or idling. The CPU produces most of the heat so naturally the cpu@4c is hotter than local_f75303@4d and it’s unlikely to find the error massage by coincidence.

If the computer is started after a previous shutdown, or resumed from hibernation. The initial startup tasks will bring the CPU up to higher than 40C, no “Sensor 3 error” in this case.

However, if the computer is resumed from suspend, the initial temperature is only a little bit higher than ambient if it had been suspended for a while, and since the 7840U FW13 use s2idle, there’s very little power consumed and heat produced to resume from suspend. If the weather is temperate or cool (18~25C), the CPU might be at exactly 29C, showing the error.

When doing demanding task, the CPU temperature will increase to much higher than 29C. The fan control will resume working immediately. Thus, even if the “Sensor 3 error” is preventing autofanctrl, the CPU won’t overheat.

Initial heating after starting up

$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       310 K (= 37 C)           0% (313 K and 343 K)
cpu_f75303@4d         310 K (= 37 C)           0% (319 K and 327 K)
ddr_f75303@4d         311 K (= 38 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                310 K (= 37 C)           0% (376 K and 378 K)
$ sudo ectool temps all 
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       305 K (= 32 C)           0% (313 K and 343 K)
cpu_f75303@4d         305 K (= 32 C)           0% (319 K and 327 K)
ddr_f75303@4d         306 K (= 33 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                303 K (= 30 C)           0% (376 K and 378 K)

idling at 24C outdoor with light wind

$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       303 K (= 30 C)           0% (313 K and 343 K)
cpu_f75303@4d         304 K (= 31 C)           0% (319 K and 327 K)
ddr_f75303@4d         305 K (= 32 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                303 K (= 30 C)           0% (376 K and 378 K)
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       303 K (= 30 C)           0% (313 K and 343 K)
cpu_f75303@4d         304 K (= 31 C)           0% (319 K and 327 K)
ddr_f75303@4d         304 K (= 31 C)        N/A (fan_off=401 K, fan_max=401 K)
Sensor 3 error
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       304 K (= 31 C)           0% (313 K and 343 K)
cpu_f75303@4d         305 K (= 32 C)           0% (319 K and 327 K)
ddr_f75303@4d         305 K (= 32 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                306 K (= 33 C)           0% (376 K and 378 K)
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       303 K (= 30 C)           0% (313 K and 343 K)
cpu_f75303@4d         304 K (= 31 C)           0% (319 K and 327 K)
ddr_f75303@4d         304 K (= 31 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                303 K (= 30 C)           0% (376 K and 378 K)
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       303 K (= 30 C)           0% (313 K and 343 K)
cpu_f75303@4d         304 K (= 31 C)           0% (319 K and 327 K)
ddr_f75303@4d         304 K (= 31 C)        N/A (fan_off=401 K, fan_max=401 K)
Sensor 3 error

btop showed CPU at exactly 29C when “Sensor 3 error” happens.

Wind picked up, CPU cooled to 28C

$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       302 K (= 29 C)           0% (313 K and 343 K)
cpu_f75303@4d         302 K (= 29 C)           0% (319 K and 327 K)
ddr_f75303@4d         303 K (= 30 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                301 K (= 28 C)           0% (376 K and 378 K)
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       302 K (= 29 C)           0% (313 K and 343 K)
cpu_f75303@4d         303 K (= 30 C)           0% (319 K and 327 K)
ddr_f75303@4d         303 K (= 30 C)        N/A (fan_off=401 K, fan_max=401 K)
Sensor 3 error
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       302 K (= 29 C)           0% (313 K and 343 K)
cpu_f75303@4d         302 K (= 29 C)           0% (319 K and 327 K)
ddr_f75303@4d         303 K (= 30 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                301 K (= 28 C)           0% (376 K and 378 K)
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       302 K (= 29 C)           0% (313 K and 343 K)
cpu_f75303@4d         302 K (= 29 C)           0% (319 K and 327 K)
ddr_f75303@4d         303 K (= 30 C)        N/A (fan_off=401 K, fan_max=401 K)
Sensor 3 error
$ sudo ectool temps all
--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       303 K (= 30 C)           0% (313 K and 343 K)
cpu_f75303@4d         303 K (= 30 C)           0% (319 K and 327 K)
ddr_f75303@4d         303 K (= 30 C)        N/A (fan_off=401 K, fan_max=401 K)
cpu@4c                303 K (= 30 C)           0% (376 K and 378 K)

pierce · December 16, 2024, 8:35am

Not in my experience. When I noticed the sensor was persistently faulty, it was because my system had gotten quite hot while I was playing a game, with CPU around 100c for some minutes. One time the fan was “stuck” at 0 rpm, and another time it was “stuck” at 2000 rpm (about 20% by pwm, though max rpm is about 7000), because that’s what it was at when the sensor stopped working, and the fan control thus stopped working. The last successful read of “cpu@4c” temp must have been around 60c in that 2000 rpm scenario.

gammy · February 17, 2025, 11:33pm

Hello all, I just discovered this thread and what a fascinating thread it is. I’m also sorry for what I’m about to write, as I realize I’m going to sound like a blathering idiot compared to your detailed measurements (I have none, I’m sorry).

I recently (today in fact) replaced my fan (I received a new one from Framework after some wrangling) and from what I can tell the most likely cause of the fault (from a mechanical perspective) is that the lubricant has degraded, which is likely due to prolonged periods of high temperatures. This correlates very well with the thermal readout problem:

First off, suspend has never worked for me (but I’m mostly on Debian sid for … reasons) so my laptop is turned on for days/weeks on-end (unless it crashes); very rarely do I shut it down, so I don’t correlate the behaviour with suspend or “temperature-shift due to wake or boot”. I’ve also been in an environment with a very stable temperature; the variance is my workload being sporadic bursts of compiling code, and rendering video with ffmpeg).

I’ve not - as will now become apparent - investigated this issue rigorously; it’s only now that I’m connecting the dots. I’ve been aware of this issue for a while: I started working on a little ectool-wrapper (ecfantemp) a couple of months ago when I noticed that my laptop kept getting hot (this being quite some time after my fan had started making rattling sounds). My assumption was that the parameters used by the fan-speed algorithm were unsuited for my workloads, and in wanting to tweak the parameters frequently to find a good balance by hand, I wanted a down-to-earth tool to do it with minimal clutter.

Long story short, hacking on this tool meant I very frequently read values from the sensors (my wrapper also having a ‘watch’-mode), and quite frequently looked at those values and the fan speed. It was then that I discovered the cpu@4c read errors, along with something else: multiple sensors reporting the same temperature, which then stop updating until after a reboot. The temperature shared between the sensors hasn’t been the same value across reboots: it’s been between - recalling from memory - 303 K (29.85 C) - 314 K (40.85 C). ectool would have reported these values as 29-40 C as it discards the fraction rather than rounding up, but these values have definitely been incorrect; the reason for me discovering them being because my lap kept getting burned from the fan not spinning! A (real) example of the only data I have on hand right now (taken from an angry message I sent to a friend):

--sensor name -------- temperature -------- ratio (fan_off and fan_max) --
local_f75303@4d       314 K (= 41 C)           3% (313 K and 343 K)
cpu_f75303@4d         314 K (= 41 C)           0% (319 K and 327 K)

The reason for the fan not spinning up becomes obvious: the ‘stuck’ temperature has been below or just-below my ‘fan off’ temperature of 313K! Unlike @pierce I have never observed the fan maxing out, or the values being stuck “high”: it’s always been the opposite for me.

I’ll keep a much closer eye on my readings from now on to see if I can help to provide more concrete information.
Thanks all for your posts, very interesting reading.

gammy · March 5, 2025, 12:13pm

A new BIOS/EC beta was released today:

One of the fixes mentioned may be related:

Fix ACPI thermal_zone3 reporting incorrect values occasionally.

On my laptop, the update (going from the previous 3.06 beta) means:

       EC     BIOS
From   f666c  JFP30.03.06
To     55046  JFP30.03.07

Indeed, EC f666c has been followed by 55046.
Let’s see how it fares.

Topic		Replies	Views
[RESPONDED] Temperature Sensor Locations Linux	4	887	February 25, 2024
Framework 13 AMD cpu temps? Linux	3	401	January 25, 2025
Running very hot, high ACPI temps Framework Laptop 13	6	3403	August 23, 2022
FW13 AMD 7840u thermal throttle at much lower temperature Linux debian	26	793	January 23, 2025
Uneven CPU thermals! Framework Laptop 16 framework-laptop-16-amd-7040	1320	28245	June 3, 2025

[RESPONDED] FW 13 7840U ACPI thermal readout problem

Related topics