[RESPONDED] FW 13 7840U ACPI thermal readout problem

I hope to have answered all of them in a useful way.
If you have more questions or ideas let me know.

Yes, that was informative and helpful. Thanks.

Most likely by looking at the logs and EC source code.

I’ll start digging through the source for the EC and ectool when I have time then.

The fix needs to come from the ACPI supplier, going through Framework.

Sounds like Framework would have to chase the fix for this then. I guess I’ll create a support ticket or something when I have more info.

I wrote the hwmon driver for the CrOS EC

Nice. I was not aware of this driver; it looks useful. Unfortunately it seems my kernel was not shipped with it though, so I guess I’ll be compiling it soon.

The ACPI firmware exposes these readings under standard ACPI interfaces so it works everywhere.
But for example it misses the labels for the sensors, which my driver also exposes.
And there is data for which no standard ACPI interfaces may exist, so a dedicated driver makes sense.

Ah, so there are two interfaces.



I found some interesting information in the message thread about your v2 patches while looking into the driver you wrote. You’d have already read it, but I’ll reproduce it here for completeness sake.

Stephen Horvath:
Oh I see, I haven’t played around with the temp sensors until now, but I
can confirm the last temp sensor (cpu@4c / temp4) will randomly (every
~2-15 seconds) return EC_TEMP_SENSOR_ERROR (0xfe).
Unplugging the charger doesn’t seem to have any impact for me.
The related ACPI sensor also says 180.8°C.
I’ll probably create an issue or something shortly.
- [v2,1/2] hwmon: add ChromeOS EC driver - Patchwork
(corroborated by Guenter Roeck in the following message as well)

This matches my experience, so I guess this is a known issue. Checking with ectool temps all while the problem occues reports Sensor 3 error, so at least the EC handles the error even if ACPI doesn’t.

Would you happen to know what the status of his plan to create an issue is? It sounds to me like he’s referring to reporting this to Framework, so perhaps they’re already aware. If so, following or joining whatever existing effort there may be to track down the issue sounds beneficial.

It will only be part of v6.11, so no kernel shipped with it yet.
Backporting it will be a bit annoying because it also requires new utility functions and the MFD bits.

Both the ACPI firmware and the Linux driver read the data from the EC through the same interface. Same for ectool.
(I’m decently sure)

Good find, I forgot about that one.
No idea what became of the plan.

It will only be part of v6.11, so no kernel shipped with it yet.

Ah, of course. I probably don’t plan to compile and run an rc kernel, so I guess I’ll get it at release.

Same for ectool.
(I’m decently sure)

Hm. I think I don’t understand how the ACPI firmware is related to the EC. I imagined two interfaces for the hardware sensors: an EC interface and an ACPI interface, and thought the former correctly detected the error where the latter did not because ectool (EC interface) reported the error where acpi (ACPI interface) did not.
It seems like you are saying that the EC is the root source of the measurement that then exposes the temp sensors (including the error value) to the ACPI firmware that incorrectly reports the error value as a valid temp measurement.
Something like this:

hardware probe ─> EC ─> ACPI firmware
                  |     └─> interface (read by Linux kernel ACPI driver)
                  └─> interface (read by ectool and the ec_* drivers)

What is the correct relationship?

No idea what became of the plan.

Do you know if he has an account here that I could message him to ask? If not, I suppose I’ll email and ask? Unless it’s better if you do it.

Exactly.

ACPI is only a bytecode definition that can be used to map standard datastructures and interfaces to the concrete hardware implementations on the platform.
As the sensor is hooked up to the EC, the ACPI functions also read the EC memory map.

Sounds good. You could even respond to the original mail.

Exactly.

Great.

Sounds good.

Will do.

Looking into the EC log has not been helpful. I wasn’t sure what to make of the 4 digit groupings of the port 80 / POST codes, but no matter how I parse them, most are not on the list shared by NRP, which may well be out of date and/or not apply to the AMD boards even if it was current. One post suggested that the codes may be LSB, but that didn’t seem help here. It’s entirely possible I’m not reading them correctly, but I don’t know what else to look for.

POST codes aside, none of the other lines were particularly meaningful to me either.

[xxxxxx.xxxxxx SB-SMI: Mailbox transfer timeout]
[xxxxxx.xxxxxx SB-RMI Error: 4]

Something AMD platform related (kernel driver docs for the interface); doesn’t look relevant.

[xxxxxx.xxxxxx HC 0x0115 err 1]

A “host command” related to reading and deleting PD logs.

[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]

Produced on each run of ectool console to fetch the EC version and read protocol info.


Additionally, reading the logs when there was no error, waiting a few seconds for a sensor read to fail, and rerunning the command showed that there were no particular log entries generated by the failed read, as demonstrated by the log ending in

[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]
[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]

If the POST codes are written to the log out of chronological order with other entries (due to polling or something), then it’s possible my “window test” wouldn’t have captured the log, but it happens so often that it should have shown up earlier in the logs.
Perhaps the EC only emits log entries for repeated failed reads after a particular interval to prevent flooding the buffer, but I haven’t caught anything that stood out to me.


Unless someone else has another idea, I think all that’s left for me to do is to see if Stephen Horvath has anything important to add and then contact Framework support.

Hi Guest68, I just got your email.

Sorry, I never got a chance to create an issue. I got distracted with trying to learn how the EC works and maybe fix it myself, and then forgot about it all together.

I did however start writing a draft in my notes app if it helps:


Hi, the sensor cpu@4c seems to intermittently return EC_TEMP_SENSOR_ERROR (0xfe). The same sensor using ACPI also returns 180.8°C while this is happening, which seems very wrong.

It seems to occur more frequently while unplugged from the charger, but it can occur while charging.

I’m not too experienced with the CrOS EC codebase, but can’t find anything obvious in the code that would cause it. So I’m hoping it’s not a interference/hardware issue, but I wouldn’t be surprised if it is.

This also seems to occur for @t-8ch while discussing a Linux hwmon driver.

Command Outputs:

sensors acpitz-acpi-0

acpitz-acpi-0
Adapter: ACPI interface
local_f75303@4d:  +32.8°C
cpu_f75303@4d:    +32.8°C
ddr_f75303@4d:    +30.8°C
cpu@4c:          +180.8°C

ectool temps all

--sensor name -------- temperature -------- fan speed --
local_f75303@4d       307 K (= 34 C)           0%
cpu_f75303@4d         306 K (= 33 C)           0%
ddr_f75303@4d         304 K (= 31 C)          -1%
Sensor 3 error

A Python script using my CrOS_EC_Python.

Temp Sensor 0: 308K (35°C)
Temp Sensor 1: 308K (35°C)
Temp Sensor 2: 306K (33°C)
Temp Sensor 3: Error 0xfe

Thanks for the input. Interestingly I also observe the issue primarily when on battery power, but I hadn’t made that connection yet. I’ll put together a support ticket submission soon with the combined information.

1 Like

Have either of you (or anyone else) observed this issue on one of the supported distros? I have been unable to reproduce it on a live boot of either due to them both lacking ACPITZ sensors in the sysfs. Support of course wants to see it on a supported distro though, and I can’t provide.

No. But given that this come straight from ectool, talking directly to the EC, the distro really shouldn’t matter.

Indeed, but they still ask.

No sorry, but AFAIK only kernel 6.8 won’t show the ACPI sensors, so you could upgrade or downgrade the kernel. I’m not sure if support would like that though, maybe an OEM kernel would be more acceptable for them? Also it should still error in ectool.

Ah, so it is a kernel 6.8 issue. Since I fully expect to be able to reproduce it with ectool though, that sounds the the easier option if support insists on proof from a supported distro. Thanks.

The issue that 0xfe from the EC is reported as 108 degrees is the first issue and needs newer kernels.

The issue that 0xfe is reported in the first place can be reproduced without kernel involvement through ectool.

Some more notes about where the error comes from:

The sensor cpu@4c is of type amd,sb-tsi.
The driver code in the EC repo is in driver/temp_sensor/sb_tsi.c, the only real place where the error could come from is the call to i2c_read8().

Interesting. Thanks for the info. I’ll relay it to support to make sure they are aware, but it’s looking like it’ll be a while until we get somewhere meaningful on that front.

I have discovered some other interesting behavior. It seems that during system load (as simulated by stress -c 16) the spurious readings do not occur, though they may be present immediately before and afterward. I have now also observed infrequent but reoccurring momentary reports of 181.8 from the local_f75303@4d sensor as well. Following your formula, I believe this yields a reading of 0xff or EC_TEMP_SENSOR_NOT_PRESENT.

These errors are not the only poor readings I’ve recorded. Following FW support’s request for psensor screenshots, I notice that cpu@4c readings also regularly spike to elevated and unrealistic values in a roughly 20-30° range around 100-130°C. local_f75303@4d occasionally does something similar as well, but the spike values I’ve seen for it appear to be more consistent at about 98°. Does anyone have an idea about what might be causing these strange readings?

Frustratingly, the naming of ACPITZ sensors differs between the sysfs, lm-sensors, and acpi, making consistent sensor identification challenging.

This relationship seems to be correct to me:

ectool sysfs lm-sensors acpi
local_f75303@4d thermal_zone0 temp1 Thermal 2
cpu_f75303@4d thermal_zone1 temp2 Thermal 0
ddr_f75303@4d thermal_zone2 temp3 Thermal 3
cpu@4c thermal_zone3 temp4 Thermal 1

Here are two screenshots of psensor showing the above observations. The blocks of elevated temperature readings around 100° in both are from running stress. Note that psensor rounds readings to the nearest whole degree and uses lm-sensor for it’s data source.

I guess you need that relationship for psensors?
Otherwise I would stick to the ectool names.
You can also install Linux v6.11 (rc) to get the native CrOS EC hwmon driver with the correct names and error values.
(And no firmware shenanigans in between)

My understanding from Guenter in the original mailing list thread is 0xff should never be returned dynamically, it should be compiled into the firmware when it literally has no support for more sensors. So it’s quite interesting that you’re getting that.

If it helps, I put together a quick python script to read the raw bytes from the memmap:

#!/usr/bin/env python3

EC_LPC_ADDR_MEMMAP = 0xE00  # 0x900 on Intel Frameworks
EC_MEMMAP_TEMP_SENSOR = 0x00
EC_TEMP_SENSOR_ENTRIES = 4  # Usually 16, but there's only 4 here anyway
EC_TEMP_SENSOR_OFFSET = 200

# /dev/port needs root!
with open("/dev/port", "rb") as f:
    f.seek(EC_LPC_ADDR_MEMMAP + EC_MEMMAP_TEMP_SENSOR)
    data = list(f.read(EC_TEMP_SENSOR_ENTRIES))
    for i, byte in enumerate(data):
        tempK = byte + EC_TEMP_SENSOR_OFFSET
        tempC = tempK - 273
        print(f"Sensor {i}: 0x{byte:02X} ({tempC}°C)")

I guess you need that relationship for psensors?

Correct. I use the ectool names for clarity, but since this is 6.10 the chart is helpful to match with psensor. It also contributes a bit of documentation to public record that I haven’t seen anywhere else. Hopefully it’ll soon be unnecessary due to 6.11, but since it was necessary here, I might as well publish it.

You can also install Linux v6.11 (rc) to get the native CrOS EC hwmon driver with the correct names and error values.

I’ve considered it before, but put it off so far. I think I will though; I expect it’ll help a fair bit to have the direct comparison.

0xff should never be returned dynamically

Interesting. We’ll see what I find I guess.

If it helps, I put together a quick python script to read the raw bytes from the memmap

Nice, thanks.

Good news: my support ticket has reached the level of Matt Hartley and he reports that he is sending it over to the engineering team as well as putting together an in-house ticket. Hopefully this is not the end of the road and we’ll get updated information as they investigate.

I haven’t finished looking into the additional strange behaviors I noted in my message with the psensor screenshots above, but in light of this development, I’ll share some preliminary results to help refine the problem scope.

I have not been able to independently capture any of the “intermediate” values of cpu@4c that the psensor graphs seemed to report. I tested by running ectool temps all and Stephen’s python script in a Bash loop repeating every 0.5 seconds while psensor was graphing and watching for any notable value in the command output during a time window when psensor displayed frequent “half height” spikes. This isn’t exactly a scientific test, but it seemed to me to be a good enough quick sanity check to remove it from the list of definitively identified issues since there are other possible explanations. These values may instead be produced by some quirk of psensor, perhaps in averaging values or somewhere along the process of drawing the graph. They my also originate somewhere further up the chain than the EC, and consequently I wouldn’t have caught them reading only the EC interface.

Similarly, I have not been able to capture an EC reading with a value matching the 182° that psensor continues to occasionally record for local_f75303@4d. This time my bash script dumped output to a log file that I then checked with grep [1]. It is possible here too that this value originates somewhere downstream of the EC, but I have also not yet tested other interfaces to check. I currently doubt that it is being introduced by psensor itself, but this of course remains unproven. It seems more likely to me that this value is “legitimately” produced somewhere, but I have no evidence at this time. The “intermediate” values of local_f75303@4d that occasionally appear on the psensor graph may also be phantom artifacts created by psensor through the same operation as the ones for cpu@4c.

More testing is needed to better pare down the possibilities for these two/three issues, but I probably won’t have the time to do it myself for a little while. Hopefully now that the engineering team is involved, they’ll be able to reproduce and debug these issues more formally than I can do in my spare time.

[1] I originally parsed before logging, but after several psensor spikes to 182 for which I caught nothing, I just logged everything to make sure it wasn’t a bug in the script.

1 Like