[RESPONDED] Higher idle power consumption after resume from s2idle

Hi All!

I am using a R5 7640U Framework 13 on Fedora 40 with KDE. I recently observed this behavior when I was interested in what effect upgrading from Fedora 39 to 40 has on idle power.

Setup to observe the idle power:
Three usb-c card slotted Left Upper, Left Lower, Right Upper. One usb-a card slotted Right Lower
Brightness 0%, bluetooth off, WiFi on, camera and microphone physical switches off, nothing opening other than a kitty session running powerstat 3 6000 -d 1

Off of a fresh reboot, I am seeing 3.3w
After a suspend session, either triggered by closing the lid, hit “sleep” from application launcher, or python amd_s2idle.py, the idle power would be around 4w.

Running python amd_s2idle.py didn’t prompt me with anything that I haven’t fixed and the laptop is able to reach hardware sleep state consistently.

My kernel parameters look like GRUB_CMDLINE_LINUX="rhgb quiet rtc_cmos.use_acpi_alarm=1 iomem=relaxed amd_iommu=off amdgpu.abmlevel=1"

I was able to reproduce this issue consistently. Wondering if anyone else is facing this same issue? Or anyone can point me in the right direction to troubleshoot?

1 Like

Hi @zzach ,

welcome to the community, I see that you’ve had upgraded from Fedora 39 to Fedora 40 KDE, can you check with clean fedora 40 live to see if you’re having same s2idle results as your installed system?

Suggest comparing /sys/kernel/debug/amd_pmf/current_power_limits before and after suspend to see if identical.

If they’re identical try changing to power saver and back to balanced and see if that helps it.

If neither of those are fruitful I think manually comparing powertop top consumers is the next step.

before suspend
powersave
spl:15000 fppt:30000 sppt:15000 sppt_apu_only:54000 stt_min:15000 stt[APU]:0 stt[HS2]: 0

balanced
spl:28000 fppt:35000 sppt:33000 sppt_apu_only:54000 stt_min:28000 stt[APU]:0 stt[HS2]: 0

after suspend

powersave
spl:15000 fppt:15000 (it is 15000 when i immediately run cat after switching profile, and went back to 30000 when i cat again) sppt:15000 sppt_apu_only:54000 stt_min:15000 stt[APU]:0 stt[HS2]: 0

balanced
spl:28000 fppt:35000 sppt:33000 sppt_apu_only:54000 stt_min:28000 stt[APU]:0 stt[HS2]: 0

I actually checked the powertop top consumers in troubleshooting a few days earlier and didn’t notice any meaningful differences.

I was able to reproduce it running fedora 40 live kde spin on a usb. Before suspend I am seeing 4.6w, after suspend triggered by running amd_s2idle.py, I am getting 5.2w. The discrepancy matches with my current installation. (Base value higher because of usb-a I presume). Should I try Gnome on this?

What state were you before suspend and what state after in the above?

And if you think it’s caused by USB A card can you run your speriment with no card plugged in to confirm it?

It was in powersave if that’s what you are asking.
It gave the same values before and after suspend, which is

spl:15000 fppt:30000 sppt:15000 sppt_apu_only:54000 stt_min:15000 stt[APU]:0 stt[HS2]: 0

I also tried with having my laptop in balanced before going into suspend, cat the value, suspend and then cat again. the values are the same as well which is

spl:28000 fppt:35000 sppt:33000 sppt_apu_only:54000 stt_min:28000 stt[APU]:0 stt[HS2]: 0

I wasn’t thinking it was caused by USB A card, was just saying that having USB A card to test run Fedora 40 live would give me a higher idle power to do the subtraction. But the difference between before suspend and after is the same. But I redid the experiment anyway without USB A card, the results are consistent.

Another question would be about

If they’re identical try changing to power saver and back to balanced and see if that helps it.

Should this be identical before and after suspend? If not, why is that?

OK, it’s good that they were identical. That means it’s not a problem with the EC handling of the coefficients for cTDP that is causing this.

The reason I was asking to switch modes and back and forth was to also rule out a problem with EPP not getting programmed correctly.

It sounds like tracking down the root cause of this one is going to be a doozy!

1 Like

I went back to powertop and measure it again
This is before suspend

This is after suspend

It looks like the display backlight is the main difference. Can you check the brightness value from sysfs is identical in both cases?

Checking /sys/class/backlight/amdgpu_bl1/brightness gave me 1, /sys/class/backlight/amdgpu_bl1/actual_brightness gave me 0 both before and after suspend. I am guessing powertop couldn’t really figure out where the extra power consumption comes from, so it adds to backlight? I don’t think the power estimates are accurate for display backlight under both cases though.

Yeah if the brightness really is identical between the two cases the display power consumption should be the same.

I do actually have another theory. Can you compare lspci -vv output before and after suspend? Does L1SS change for any device? If so; it’s pointing at a kernel driver or firmware bug for that device.

diff between before suspend and after suspend

35c35
< 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
---
> 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
53c53
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
59c59
< 			Changed: MRL- PresDet- LinkState+
---
> 			Changed: MRL- PresDet- LinkState-
113c113
< 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
---
> 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
131c131
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
377c377
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
429c429
< 		Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
---
> 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
442c442
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
507c507
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
629c629
< 		Address: 00000000fee09000  Data: 0022
---
> 		Address: 00000000fee0b000  Data: 0023
641,643c641,643
< 		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
< 			   T_CommonMode=0us LTR1.2_Threshold=166912ns
< 		L1SubCtl2: T_PwrOn=150us
---
> 		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> 			   T_CommonMode=0us LTR1.2_Threshold=0ns
> 		L1SubCtl2: T_PwrOn=10us
648c648
< 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
---
> 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
675c675
< 		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
---
> 		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
705c705
< 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
---
> 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
727,729c727,729
< 		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
< 			   T_CommonMode=0us LTR1.2_Threshold=166912ns
< 		L1SubCtl2: T_PwrOn=150us
---
> 		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> 			   T_CommonMode=0us LTR1.2_Threshold=0ns
> 		L1SubCtl2: T_PwrOn=10us
810c810
< 	Interrupt: pin B routed to IRQ 113
---
> 	Interrupt: pin B routed to IRQ 114
842c842
< 		Address: 00000000fee02000  Data: 0023
---
> 		Address: 00000000fee08000  Data: 0022
1044c1044
< 	Interrupt: pin C routed to IRQ 114
---
> 	Interrupt: pin C routed to IRQ 59
1076c1076
< 		Address: 00000000fee03000  Data: 0023
---
> 		Address: 00000000fee00000  Data: 0020
1086c1086
< 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
---
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
1088c1088
< 	Latency: 0, Cache Line Size: 64 bytes
---
> 	Latency: 0, Cache Line Size: 1020 bytes
1092c1092
< 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
---
> 		Status: D0 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
1096c1096
< 		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
---
> 		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
1098c1098
< 			MaxPayload 128 bytes, MaxReadReq 512 bytes
---
> 			MaxPayload 16384 bytes, MaxReadReq 16384 bytes
1102,1103c1102,1103
< 		LnkCtl:	ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
< 			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
---
> 		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
> 			ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt-
1110,1113c1110,1113
< 		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
< 			 AtomicOpsCtl: ReqEn-
< 			 IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
< 			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
---
> 		DevCtl2: Completion Timeout: Unknown, TimeoutDis+
> 			 AtomicOpsCtl: ReqEn+
> 			 IDOReq+ IDOCompl+ LTR- EmergencyPowerReductionReq-
> 			 10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
1115,1117c1115,1117
< 		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
< 			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
< 			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
---
> 		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance+ SpeedDis+
> 			 Transmit Margin: Unknown, EnterModifiedCompliance+ ComplianceSOS+
> 			 Compliance Preset/De-emphasis: Unknown
1138c1138
< 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
---
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
1140c1140
< 	Latency: 0, Cache Line Size: 64 bytes
---
> 	Latency: 0, Cache Line Size: 1020 bytes
1154,1156c1154,1156
< 			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
< 			MaxPayload 128 bytes, MaxReadReq 512 bytes
< 		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
---
> 			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> 			MaxPayload 16384 bytes, MaxReadReq 512 bytes
> 		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
1183,1184c1183,1184
< 		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
< 		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
---
> 		UEMsk:	DLP+ SDES- TLP+ FCP- CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
> 		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
1186c1186
< 		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
---
> 		CEMsk:	RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+

I don’t really know how to interpret this diff though. Also this time when I triggered suspend using amd_s2idle.py, the suspend went over the suspend cycle and I had to wake the laptop manually. Then I triggered suspend again, and it was back to normal.

Can you post both the files to a Github gist? There is definitely changes that I don’t expect there and I would suspect them to be the root cause but I need to better understand what they are.

Here they are. github gists

I did another pass of your instruction, and the files generated that pass would have _2 as suffix.

From what you’ve shared it looks like the following has changed:

  • Wifi is not in L1.1 or L1.2 anymore
  • NVME is not in L1.2 L1.2 anymore
  • L1.2 thresholds changed
  • The root port at 08.2 isn’t in D3 after resume.

If I was to guess without looking at code I think the L1.2 thresholds changing leads to wifi and NVME not going into L1.2 anymore and is the source of those problems.
I think it’s actually the same issue being discussed here: Re: [PATCH v5 4/4] PCI/ASPM: Fix L1.2 parameters when enable link state - David E. Box

That root port not in D3 after resume is surprising; it’s supposed to be by this quirk: linux/arch/x86/pci/fixup.c at master · torvalds/linux · GitHub

2 Likes

I think it might be because the NPU is now in the wrong state from that quirk. When the XDNA driver is loaded it should be fixed.

This is only compile tested, but see if it helps.

If it doesn’t help, can you please compare lspci again and also share me a kernel log after you’ve suspended/resumed.

1 Like

Tested and I don’t think it worked. Before suspend would be now 4.1w and after would be 4.7w. Also it was first time building my kernel with patch, I think I did it right but not 100% sure, so please let me know if I did anything incorrectly. (Downloaded your patch, point to it in kernel.spec in patch section, and then followed the fedora guidelines on how to build and install).
Here are all the logs and diffs.