Thanks for the response; that’s a good thought. I tried with a live boot of both Fedora 40 and Ubuntu 24.04 (both used 6.8 series kernels), but unfortunately neither appeared to detect any acpitz
sensors. Consequently the only devices in /sys/class/thermal/
were cooling_device{0..15}
, and acpi -t
produced no output.
Is there something I’m missing here?
What does sensors command say for both distro?
It seems “.txt” is not an authorized file extension, so here’s the copy-paste of the three.
Artix
ucsi_source_psy_USBC000:003-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +1.50 A)
ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 680.00 mA (max = +0.00 A)
ucsi_source_psy_USBC000:004-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 0.00 A (max = +0.00 A)
amdgpu-pci-c100
Adapter: PCI adapter
vddgfx: 857.00 mV
vddnb: 652.00 mV
edge: +38.0°C
PPT: 6.22 W (avg = 4.12 W)
BAT1-acpi-0
Adapter: ACPI interface
in0: 15.49 V
curr1: 386.00 mA
mt7921_phy0-pci-0100
Adapter: PCI adapter
temp1: +36.0°C
ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +1.50 A)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +39.9°C
nvme-pci-0200
Adapter: PCI adapter
Composite: +33.9°C (low = -273.1°C, high = +89.8°C)
(crit = +94.8°C)
Sensor 1: +33.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +33.9°C (low = -273.1°C, high = +65261.8°C)
acpitz-acpi-0
Adapter: ACPI interface
temp1: +38.8°C
temp2: +38.8°C
temp3: +38.8°C
temp4: +180.8°C
Fedora
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +47.9°C
ucsi_source_psy_USBC000:004-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 0.00 A (max = +0.00 A)
ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +1.50 A)
nvme-pci-0200
Adapter: PCI adapter
Composite: +33.9°C (low = -273.1°C, high = +89.8°C)
(crit = +94.8°C)
Sensor 1: +33.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +33.9°C (low = -273.1°C, high = +65261.8°C)
mt7921_phy0-pci-0100
Adapter: PCI adapter
temp1: +38.0°C
amdgpu-pci-c100
Adapter: PCI adapter
vddgfx: 731.00 mV
vddnb: 653.00 mV
edge: +41.0°C
PPT: 5.17 W (avg = 6.09 W)
ucsi_source_psy_USBC000:003-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +1.50 A)
ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 680.00 mA (max = +0.00 A)
BAT1-acpi-0
Adapter: ACPI interface
in0: 15.50 V
curr1: 602.00 mA
Ubuntu
mt7921_phy0-pci-0100
Adapter: PCI adapter
temp1: +30.0°C
ucsi_source_psy_USBC000:004-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 680.00 mA (max = +0.00 A)
ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +1.50 A)
nvme-pci-0200
Adapter: PCI adapter
Composite: +33.9°C (low = -273.1°C, high = +89.8°C)
(crit = +94.8°C)
Sensor 1: +33.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +27.9°C (low = -273.1°C, high = +65261.8°C)
amdgpu-pci-c100
Adapter: PCI adapter
vddgfx: 679.00 mV
vddnb: 651.00 mV
edge: +37.0°C
PPT: 4.21 W (avg = 4.23 W)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +39.6°C
ucsi_source_psy_USBC000:003-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +1.50 A)
ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 0.00 A (max = +0.00 A)
BAT1-acpi-0
Adapter: ACPI interface
in0: 14.81 V
curr1: 535.00 mA
The erratic reporting seems to happen intermittently as well. For several hours everything will be fine, but then at some point the sensor value starts cycling between (presumably) good and bad readings. After a while it will return to working normally again. I don’t see any particular pattern to when it’s reliable or erratic, based on temp or anything else (and least it appears elevating the temperature with a large project compile or stress
does not induce the behavior).
Interestingly all of the temperature readings I have seen from the acpitz
sensors end in .8
. Is this just a quirk of the hardware?
Hey, @Loell_Framework, just wanted to check on this since it’s been a bit and I haven’t heard back from you. Is there anything else I should try? Did the sensors
output have any useful info?
After checking the Discord group, it appears several other users have noticed the same issue on other (Intel) models (at least 12th gen) as well. The oldest oldest post I found was from 2022/10/16. A search for 179.8
or 180.8
will yield relevant results. I also found one other forum post referencing the same elevated reading, but there was no additional info there.
In response to one user’s recent (2024/08/04) question regarding the normalicy of a 179.8° ACPITZ reading, Dustin Howett provided this insight:
probably slightly normal at least until kernel 6.10 or 6.11
(spurious readings from the memory-mapped I/O region of the embedded controller)
assuming it returns to normal shortly after
if not: uh no
- source (requires an account to view)
Later he double checked and confirmed that the fix is scheduled for a 6.11 release with no backport to 6.10.
So it seems this has been a Linux kernel issue affecting several (all?) FW 13 models for quite a while, but will soon be resolved.
Ah, sorry! I didn’t realize that folks were experiencing this with the AMD platforms.
The MMIO fix was not required on the AMD Frameworks Laptop because they were not susceptible to that specific issue.
Unfortunately, that means that 6.11 will bring no relief for folks suffering this issue on AMD and that the root cause is still unknown.
Well that was a quick response. And unfortunate. Can you provide any insight about what’s going on here, with either the Intel or AMD versions? Why might they exhibit the same issue, but have different causes?
Also, for you or whoever else may be looking into this in the future, let me know if there is anything I can provide to assist with identifying or resolving the issue on the AMD side. Unless you think it’s hardware related?
The value 180800
looks suspicious.
The original value comes from the ChromeOS EC and gets read by the application processor (Linux) via a shared memory segment.
It is a single byte which get transformed by the following formula into the millicelsius value you see in sysfs:
(x + 200) * 1000 - KELVIN_TO_CELSIUS_OFFSET
KELVIN_TO_CELSIUS_OFFSET
is 273150 in the Linux kernel, but here the calculation is done by the ACPI firmware which seems to use 273200.
Then for x = 0xfe
we get the observed value 180800
.
0xfe
in turn is a special value meaning EC_TEMP_SENSOR_ERROR
.
So there are two issues:
- The EC fails to read the sensor. (Maybe the EC logs help investigating)
- The ACPI firmware incorrectly reports an error value as a real result value. This should be fixed in the firmware.
I see. Thanks for the informative reply. Sounds like the place to start my investigation is the EC then.
To that end:
- How do I troubleshoot the EC failing to read the sensor? Does that depend entirely on what the logs report? Is this failure likely to be hardware or firmware? Is there anything I can do about the failure?
- When you say the ACPI firmware incorrectly handles the error value, is that the fault of the kernel interface or the Framework firmware? What does the process to fix that look like? Is that something that Framework needs to be involved with, or is that just the realm of you and/or the other kernel contributors (or both)?
I also have a few questions of varying relevance to the current issue if you have a moment:
- Where does this information about the temp sensor and related formula come from? Just kernel source? The ACPI spec? Somewhere else? I’d like to look more into this and know for the future how to trace such a problem. Would I simply need to be familiar with the software to have known
180800
was the error value?EC_TEMP_SENSOR_ERROR
seems to be defined by the Framework/ChromeOS EC though, so at least I know where to find that one. - You reference a kernel formula and offset value, but say that here the calculation is done by the firmware (presumably exposing the result to the kernel under an identifier for pre-calculated values?). Why are there two places where ACPITZ values are interpreted? Why choose to implement one over the other, particularly as it pertains to Framework?
- Why does the kernel
KELVIN_TO_CELSIUS_OFFSET
differ from the one used by the Framework ACPI firmware?
Additionally (intentionally or otherwise) I now have an answer to my previous question about why the ACPI temp sensors all report values in 1 degree increments that always end in .8
. I suppose only having single degree precision is the consequence of using a single byte, but this leads me to more questions…
- Why report values with a decimal component if a single degree is the most precise the exposed reading will be?
- Why
.8
? Is it just a quirk of the necessary offset?
I realize this is a lot of questions, so I understand if you can’t answer all of them. Thanks for your time regardless.
Here are a few recording of the EC log from ectool console
(Dustin’s fork) while bad values were being reported:
console log 1
[618483.934400 SB-SMI: Mailbox transfer timeout]
[618483.935600 SB-RMI Error: 4]
[618486.962800 SB-SMI: Mailbox transfer timeout]
[618486.964200 SB-RMI Error: 4]
[618493.918500 Battery 65% (Display 65.4 %) / 8h:33 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[618517.618900 HC 0x0115 err 1]
[618541.867300 Battery 65% (Display 65.3 %) / 8h:11 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[618578.763400 Battery 65% (Display 65.2 %) / 8h:22 to empty]
[618583.327400 HC 0x0115 err 1]
[618624.463600 Battery 65% (Display 65.1 %) / 8h:10 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[618648.383300 HC 0x0115 err 1]
[618671.599000 Battery 65% (Display 65.0 %) / 9h:4 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[618706.758400 Battery 65% (Display 64.9 %) / 8h:5 to empty]
[618712.375000 HC 0x0115 err 1]
[618756.670700 Battery 65% (Display 64.8 %) / 8h:20 to empty]
PORT80: F022
PORT80: F90E
[618776.392800 HC 0x0115 err 1]
[618801.555500 Battery 65% (Display 64.7 %) / 7h:8 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[618839.508100 HC 0x0115 err 1]
[618840.722900 Battery 65% (Display 64.6 %) / 7h:5 to empty]
[618859.032700 HC 0x0000]
[618876.061500 Battery 65% (Display 64.5 %) / 8h:35 to empty]
PORT80: F022
PORT80: F028
PORT80: F90E
[618902.470300 HC 0x0115 err 1]
[618903.688900 Battery 64% (Display 64.5 %) / 9h:17 to empty]
[618928.253900 Battery 64% (Display 64.4 %) / 8h:24 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[618966.407700 HC 0x0115 err 1]
[618968.644000 Battery 64% (Display 64.3 %) / 7h:30 to empty]
[619017.131300 Battery 64% (Display 64.2 %) / 8h:0 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[619032.173500 HC 0x0115 err 1]
[619053.699500 Battery 64% (Display 64.1 %) / 8h:27 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[619093.333400 HC 0x0115 err 1]
[619102.869300 Battery 64% (Display 64.0 %) / 8h:41 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[619146.017100 Battery 64% (Display 63.9 %) / 8h:4 to empty]
[619154.451000 HC 0x0115 err 1]
[619197.194500 Battery 64% (Display 63.8 %) / 8h:56 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[619215.619800 HC 0x0115 err 1]
[619230.064600 Battery 64% (Display 63.7 %) / 7h:25 to empty]
[619265.434600 Battery 64% (Display 63.6 %) / 6h:7 to empty]
PORT80: F90D
PORT80: F90E
[619277.571100 HC 0x0115 err 1]
[619307.581200 Battery 64% (Display 63.5 %) / 7h:25 to empty]
PORT80: F022
PORT80: F028
PORT80: F90E
[619338.732200 HC 0x0115 err 1]
[619347.223400 Battery 63% (Display 63.4 %) / 6h:26 to empty]
[619378.331700 Battery 63% (Display 63.3 %) / 7h:29 to empty]
PORT80: F90E
[619400.156800 HC 0x0115 err 1]
[619423.489500 Battery 63% (Display 63.2 %) / 8h:30 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[619461.266600 HC 0x0115 err 1]
[619471.665000 Battery 63% (Display 63.1 %) / 8h:19 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[619515.315900 Battery 63% (Display 63.0 %) / 7h:55 to empty]
[619522.982600 HC 0x0115 err 1]
[619542.409400 Battery 63% (Display 62.9 %) / 6h:27 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[619584.337100 HC 0x0115 err 1]
[619584.812600 Battery 63% (Display 62.8 %) / 7h:46 to empty]
[619632.479500 Battery 63% (Display 62.7 %) / 7h:29 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[619645.840600 HC 0x0115 err 1]
[619668.141500 Battery 63% (Display 62.6 %) / 7h:41 to empty]
PORT80: F90E
[619709.766200 HC 0x0115 err 1]
[619712.762900 Battery 63% (Display 62.5 %) / 8h:10 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[619762.690800 Battery 63% (Display 62.4 %) / 8h:49 to empty]
[619773.193700 HC 0x0115 err 1]
[619794.806800 Battery 62% (Display 62.4 %) / 6h:55 to empty]
[619805.844600 Battery 62% (Display 62.3 %) / 6h:38 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[619834.198400 Battery 62% (Display 62.2 %) / 6h:37 to empty]
[619835.135700 HC 0x0115 err 1]
[619877.352800 Battery 62% (Display 62.1 %) / 7h:8 to empty]
PORT80: F022
PORT80: F022
PORT80: F90E
[619898.486900 HC 0x0115 err 1]
[619924.529400 Battery 62% (Display 62.0 %) / 7h:58 to empty]
PORT80: 3C01
[619943.738900 HC 0x0002]
[619943.741200 HC 0x000b]
console log 2
[619773.193700 HC 0x0115 err 1]
[619794.806800 Battery 62% (Display 62.4 %) / 6h:55 to empty]
[619805.844600 Battery 62% (Display 62.3 %) / 6h:38 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[619834.198400 Battery 62% (Display 62.2 %) / 6h:37 to empty]
[619835.135700 HC 0x0115 err 1]
[619877.352800 Battery 62% (Display 62.1 %) / 7h:8 to empty]
PORT80: F022
PORT80: F022
PORT80: F90E
[619898.486900 HC 0x0115 err 1]
[619924.529400 Battery 62% (Display 62.0 %) / 7h:58 to empty]
PORT80: 3C01
[619943.738900 HC 0x0002]
[619943.741200 HC 0x000b]
PORT80: 3C08
PORT80: F022
PORT80: F90E
PORT80: F90E
[619961.314300 HC 0x0115 err 1]
[619962.649300 Battery 62% (Display 61.9 %) / 6h:52 to empty]
[619971.750100 HC 0x0002]
[619971.754500 HC 0x000b]
[619973.328000 HC 0x0002]
[619973.332100 HC 0x000b]
[619995.016400 Battery 62% (Display 61.8 %) / 7h:38 to empty]
[620012.622300 HC 0x0002]
[620012.626600 HC 0x000b]
PORT80: F022
PORT80: F90E
PORT80: F90E
[620017.051400 HC 0x0002]
[620017.055400 HC 0x000b]
[620017.888000 HC 0x0002]
[620017.893400 HC 0x000b]
[620023.313200 HC 0x0115 err 1]
[620034.908500 Battery 62% (Display 61.7 %) / 6h:37 to empty]
[620072.329200 Battery 62% (Display 61.6 %) / 6h:20 to empty]
PORT80: F022
PORT80: F022
PORT80: F90E
[620086.851400 HC 0x0115 err 1]
[620116.233200 Battery 62% (Display 61.5 %) / 7h:16 to empty]
PORT80: F022
PORT80: F022
PORT80: F90E
[620144.082300 Battery 62% (Display 61.4 %) / 6h:10 to empty]
[620151.641100 HC 0x0115 err 1]
[620187.217200 Battery 62% (Display 61.3 %) / 6h:48 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[620205.767900 Battery 61% (Display 61.3 %) / 6h:39 to empty]
[620215.115400 HC 0x0115 err 1]
[620227.097300 Battery 61% (Display 61.2 %) / 7h:6 to empty]
PORT80: F90D
[620269.744100 Battery 61% (Display 61.1 %) / 7h:26 to empty]
[620277.285900 HC 0x0115 err 1]
[620300.102100 Battery 61% (Display 61.0 %) / 6h:35 to empty]
PORT80: 3C01
PORT80: F022
PORT80: F90E
PORT80: F90E
[620329.834700 HC 0x0002]
[620329.837100 HC 0x000b]
[620339.108500 HC 0x0115 err 1]
[620341.247100 Battery 61% (Display 60.9 %) / 7h:1 to empty]
[620376.872500 Battery 61% (Display 60.8 %) / 6h:0 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[620400.985500 HC 0x0115 err 1]
[620417.280400 Battery 61% (Display 60.7 %) / 6h:51 to empty]
[620448.373200 Battery 61% (Display 60.6 %) / 6h:40 to empty]
PORT80: F90E
PORT80: F022
PORT80: F90E
[620464.019000 HC 0x0115 err 1]
[620479.803200 HC Suppressed: 0x97=342 0x98=144 0x113=0 0x103=0 0x115=58 0x2b=0 0x67=0 0x121=0]
[620487.262000 Battery 61% (Display 60.5 %) / 6h:18 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[620527.828000 HC 0x0115 err 1]
[620532.176500 Battery 61% (Display 60.4 %) / 6h:50 to empty]
[620559.548300 Battery 61% (Display 60.3 %) / 6h:31 to empty]
[620570.541900 HC 0x0002]
[620570.546000 HC 0x000b]
PORT80: F022
PORT80: F90E
[620588.836000 HC 0x0002]
[620588.841500 HC 0x000b]
[620589.891300 HC 0x0115 err 1]
[620590.052000 HC 0x0002]
[620590.056400 HC 0x000b]
[620600.414800 Battery 61% (Display 60.2 %) / 7h:7 to empty]
[620611.202000 Battery 60% (Display 60.2 %) / 6h:49 to empty]
[620638.591300 Battery 60% (Display 60.1 %) / 5h:50 to empty]
PORT80: F022
PORT80: F90D
PORT80: F90E
[620653.640200 HC 0x0115 err 1]
[620672.382000 HC 0x0002]
[620672.386200 HC 0x000b]
[620673.677000 Battery 60% (Display 60.0 %) / 5h:38 to empty]
[620702.777200 Battery 60% (Display 59.9 %) / 6h:23 to empty]
PORT80: F90D
PORT80: F90E
PORT80: F90E
[620715.183000 HC 0x0115 err 1]
[620738.148500 Battery 60% (Display 59.8 %) / 6h:11 to empty]
PORT80: F022
PORT80: F022
PORT80: F90E
[620776.578400 HC 0x0115 err 1]
[620780.829100 Battery 60% (Display 59.7 %) / 7h:5 to empty]
[620822.206000 Battery 60% (Display 59.6 %) / 6h:45 to empty]
PORT80: F022
PORT80: F90E
PORT80: F90E
[620840.335300 HC 0x0115 err 1]
[620853.339900 Battery 60% (Display 59.5 %) / 6h:24 to empty]
[620892.261400 Battery 60% (Display 59.4 %) / 6h:38 to empty]
PORT80: F90D
PORT80: F022
PORT80: F90E
[620906.297100 HC 0x0115 err 1]
[620924.292100 HC 0x0002]
[620924.296300 HC 0x000b]
Running the command several times shows a repeating sequence of these lines appended to the log:
[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]
Occasionally interspersed with some other lines like these:
PORT80: F022
PORT80: F90D
PORT80: F90E
[620653.640200 HC 0x0115 err 1]
What do these logs mean? I’ll take a look through the docs (or the source) to do a little discovery myself when I have some time in the next few days, but I’d like some input from someone a more knowledgeable.
P.S. Can we please get .txt
as an authorized extension? Working with large pasted blocks of text is cumbersome and frustrating.
Most likely by looking at the logs and EC source code.
I have no idea about the details and solution.
It’s purely an ACPI firmware issue. The fix needs to come from the ACPI supplier, going through Framework.
The kernel can’t do anything about it.
This information comes from the EC API headers.
I recognized this value because I wrote the hwmon driver for the CrOS EC, which is completely unrelated to this specific issue, though.
The ACPI firmware exposes these readings under standard ACPI interfaces so it works everywhere.
But for example it misses the labels for the sensors, which my driver also exposes.
And there is data for which no standard ACPI interfaces may exist, so a dedicated driver makes sense.
Probably because the exact value really does not matter, and maybe some interface somewhere in the chain only supports on decimal digit.
In the kernel driver there is a preexisting constant and conversion function which is used for many drivers.
I expect the same to be true for the conversion in ACPI.
It could be rounded but it doesn’t really matter.
Yes, it’s an artifact of the kelvin offset constant.
I hope to have answered all of them in a useful way.
If you have more questions or ideas let me know.
“Port 80” is a (emulated?) debug IO port.
See io - What does the 0x80 port address connect to? - Stack Overflow
Otherwise I don’t really know and also would need to look at the EC source.
I hope to have answered all of them in a useful way.
If you have more questions or ideas let me know.
Yes, that was informative and helpful. Thanks.
Most likely by looking at the logs and EC source code.
I’ll start digging through the source for the EC and ectool
when I have time then.
The fix needs to come from the ACPI supplier, going through Framework.
Sounds like Framework would have to chase the fix for this then. I guess I’ll create a support ticket or something when I have more info.
I wrote the hwmon driver for the CrOS EC
Nice. I was not aware of this driver; it looks useful. Unfortunately it seems my kernel was not shipped with it though, so I guess I’ll be compiling it soon.
The ACPI firmware exposes these readings under standard ACPI interfaces so it works everywhere.
But for example it misses the labels for the sensors, which my driver also exposes.
And there is data for which no standard ACPI interfaces may exist, so a dedicated driver makes sense.
Ah, so there are two interfaces.
I found some interesting information in the message thread about your v2 patches while looking into the driver you wrote. You’d have already read it, but I’ll reproduce it here for completeness sake.
Stephen Horvath:
Oh I see, I haven’t played around with the temp sensors until now, but I
can confirm the last temp sensor (cpu@4c / temp4) will randomly (every
~2-15 seconds) return EC_TEMP_SENSOR_ERROR (0xfe).
Unplugging the charger doesn’t seem to have any impact for me.
The related ACPI sensor also says 180.8°C.
I’ll probably create an issue or something shortly.
- [v2,1/2] hwmon: add ChromeOS EC driver - Patchwork
(corroborated by Guenter Roeck in the following message as well)
This matches my experience, so I guess this is a known issue. Checking with ectool temps all
while the problem occues reports Sensor 3 error
, so at least the EC handles the error even if ACPI doesn’t.
Would you happen to know what the status of his plan to create an issue is? It sounds to me like he’s referring to reporting this to Framework, so perhaps they’re already aware. If so, following or joining whatever existing effort there may be to track down the issue sounds beneficial.
It will only be part of v6.11, so no kernel shipped with it yet.
Backporting it will be a bit annoying because it also requires new utility functions and the MFD bits.
Both the ACPI firmware and the Linux driver read the data from the EC through the same interface. Same for ectool.
(I’m decently sure)
Good find, I forgot about that one.
No idea what became of the plan.
It will only be part of v6.11, so no kernel shipped with it yet.
Ah, of course. I probably don’t plan to compile and run an rc kernel, so I guess I’ll get it at release.
Same for ectool.
(I’m decently sure)
Hm. I think I don’t understand how the ACPI firmware is related to the EC. I imagined two interfaces for the hardware sensors: an EC interface and an ACPI interface, and thought the former correctly detected the error where the latter did not because ectool
(EC interface) reported the error where acpi
(ACPI interface) did not.
It seems like you are saying that the EC is the root source of the measurement that then exposes the temp sensors (including the error value) to the ACPI firmware that incorrectly reports the error value as a valid temp measurement.
Something like this:
hardware probe ─> EC ─> ACPI firmware
| └─> interface (read by Linux kernel ACPI driver)
└─> interface (read by ectool and the ec_* drivers)
What is the correct relationship?
No idea what became of the plan.
Do you know if he has an account here that I could message him to ask? If not, I suppose I’ll email and ask? Unless it’s better if you do it.
Exactly.
ACPI is only a bytecode definition that can be used to map standard datastructures and interfaces to the concrete hardware implementations on the platform.
As the sensor is hooked up to the EC, the ACPI functions also read the EC memory map.
Sounds good. You could even respond to the original mail.
Exactly.
Great.
Sounds good.
Will do.
Looking into the EC log has not been helpful. I wasn’t sure what to make of the 4 digit groupings of the port 80 / POST codes, but no matter how I parse them, most are not on the list shared by NRP, which may well be out of date and/or not apply to the AMD boards even if it was current. One post suggested that the codes may be LSB, but that didn’t seem help here. It’s entirely possible I’m not reading them correctly, but I don’t know what else to look for.
POST codes aside, none of the other lines were particularly meaningful to me either.
[xxxxxx.xxxxxx SB-SMI: Mailbox transfer timeout]
[xxxxxx.xxxxxx SB-RMI Error: 4]
Something AMD platform related (kernel driver docs for the interface); doesn’t look relevant.
[xxxxxx.xxxxxx HC 0x0115 err 1]
A “host command” related to reading and deleting PD logs.
[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]
Produced on each run of ectool console
to fetch the EC version and read protocol info.
Additionally, reading the logs when there was no error, waiting a few seconds for a sensor read to fail, and rerunning the command showed that there were no particular log entries generated by the failed read, as demonstrated by the log ending in
[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]
[xxxxxx.xxxxxx HC 0x0002]
[xxxxxx.xxxxxx HC 0x000b]
If the POST codes are written to the log out of chronological order with other entries (due to polling or something), then it’s possible my “window test” wouldn’t have captured the log, but it happens so often that it should have shown up earlier in the logs.
Perhaps the EC only emits log entries for repeated failed reads after a particular interval to prevent flooding the buffer, but I haven’t caught anything that stood out to me.
Unless someone else has another idea, I think all that’s left for me to do is to see if Stephen Horvath has anything important to add and then contact Framework support.
Hi Guest68, I just got your email.
Sorry, I never got a chance to create an issue. I got distracted with trying to learn how the EC works and maybe fix it myself, and then forgot about it all together.
I did however start writing a draft in my notes app if it helps:
Hi, the sensor cpu@4c
seems to intermittently return EC_TEMP_SENSOR_ERROR (0xfe). The same sensor using ACPI also returns 180.8°C while this is happening, which seems very wrong.
It seems to occur more frequently while unplugged from the charger, but it can occur while charging.
I’m not too experienced with the CrOS EC codebase, but can’t find anything obvious in the code that would cause it. So I’m hoping it’s not a interference/hardware issue, but I wouldn’t be surprised if it is.
This also seems to occur for @t-8ch while discussing a Linux hwmon driver.
Command Outputs:
sensors acpitz-acpi-0
acpitz-acpi-0
Adapter: ACPI interface
local_f75303@4d: +32.8°C
cpu_f75303@4d: +32.8°C
ddr_f75303@4d: +30.8°C
cpu@4c: +180.8°C
ectool temps all
--sensor name -------- temperature -------- fan speed --
local_f75303@4d 307 K (= 34 C) 0%
cpu_f75303@4d 306 K (= 33 C) 0%
ddr_f75303@4d 304 K (= 31 C) -1%
Sensor 3 error
A Python script using my CrOS_EC_Python.
Temp Sensor 0: 308K (35°C)
Temp Sensor 1: 308K (35°C)
Temp Sensor 2: 306K (33°C)
Temp Sensor 3: Error 0xfe