[RESPONDED] Higher idle power consumption after resume from s2idle

Checking /sys/class/backlight/amdgpu_bl1/brightness gave me 1, /sys/class/backlight/amdgpu_bl1/actual_brightness gave me 0 both before and after suspend. I am guessing powertop couldn’t really figure out where the extra power consumption comes from, so it adds to backlight? I don’t think the power estimates are accurate for display backlight under both cases though.

Yeah if the brightness really is identical between the two cases the display power consumption should be the same.

I do actually have another theory. Can you compare lspci -vv output before and after suspend? Does L1SS change for any device? If so; it’s pointing at a kernel driver or firmware bug for that device.

diff between before suspend and after suspend

35c35
< 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
---
> 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
53c53
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
59c59
< 			Changed: MRL- PresDet- LinkState+
---
> 			Changed: MRL- PresDet- LinkState-
113c113
< 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
---
> 	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
131c131
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
377c377
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
429c429
< 		Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
---
> 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
442c442
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
507c507
< 			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
---
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
629c629
< 		Address: 00000000fee09000  Data: 0022
---
> 		Address: 00000000fee0b000  Data: 0023
641,643c641,643
< 		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
< 			   T_CommonMode=0us LTR1.2_Threshold=166912ns
< 		L1SubCtl2: T_PwrOn=150us
---
> 		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> 			   T_CommonMode=0us LTR1.2_Threshold=0ns
> 		L1SubCtl2: T_PwrOn=10us
648c648
< 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
---
> 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
675c675
< 		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
---
> 		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
705c705
< 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
---
> 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
727,729c727,729
< 		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
< 			   T_CommonMode=0us LTR1.2_Threshold=166912ns
< 		L1SubCtl2: T_PwrOn=150us
---
> 		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> 			   T_CommonMode=0us LTR1.2_Threshold=0ns
> 		L1SubCtl2: T_PwrOn=10us
810c810
< 	Interrupt: pin B routed to IRQ 113
---
> 	Interrupt: pin B routed to IRQ 114
842c842
< 		Address: 00000000fee02000  Data: 0023
---
> 		Address: 00000000fee08000  Data: 0022
1044c1044
< 	Interrupt: pin C routed to IRQ 114
---
> 	Interrupt: pin C routed to IRQ 59
1076c1076
< 		Address: 00000000fee03000  Data: 0023
---
> 		Address: 00000000fee00000  Data: 0020
1086c1086
< 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
---
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
1088c1088
< 	Latency: 0, Cache Line Size: 64 bytes
---
> 	Latency: 0, Cache Line Size: 1020 bytes
1092c1092
< 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
---
> 		Status: D0 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
1096c1096
< 		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
---
> 		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
1098c1098
< 			MaxPayload 128 bytes, MaxReadReq 512 bytes
---
> 			MaxPayload 16384 bytes, MaxReadReq 16384 bytes
1102,1103c1102,1103
< 		LnkCtl:	ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
< 			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
---
> 		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
> 			ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt-
1110,1113c1110,1113
< 		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
< 			 AtomicOpsCtl: ReqEn-
< 			 IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
< 			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
---
> 		DevCtl2: Completion Timeout: Unknown, TimeoutDis+
> 			 AtomicOpsCtl: ReqEn+
> 			 IDOReq+ IDOCompl+ LTR- EmergencyPowerReductionReq-
> 			 10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
1115,1117c1115,1117
< 		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
< 			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
< 			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
---
> 		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance+ SpeedDis+
> 			 Transmit Margin: Unknown, EnterModifiedCompliance+ ComplianceSOS+
> 			 Compliance Preset/De-emphasis: Unknown
1138c1138
< 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
---
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
1140c1140
< 	Latency: 0, Cache Line Size: 64 bytes
---
> 	Latency: 0, Cache Line Size: 1020 bytes
1154,1156c1154,1156
< 			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
< 			MaxPayload 128 bytes, MaxReadReq 512 bytes
< 		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
---
> 			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> 			MaxPayload 16384 bytes, MaxReadReq 512 bytes
> 		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
1183,1184c1183,1184
< 		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
< 		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
---
> 		UEMsk:	DLP+ SDES- TLP+ FCP- CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
> 		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
1186c1186
< 		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
---
> 		CEMsk:	RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+

I don’t really know how to interpret this diff though. Also this time when I triggered suspend using amd_s2idle.py, the suspend went over the suspend cycle and I had to wake the laptop manually. Then I triggered suspend again, and it was back to normal.

Can you post both the files to a Github gist? There is definitely changes that I don’t expect there and I would suspect them to be the root cause but I need to better understand what they are.

Here they are. github gists

I did another pass of your instruction, and the files generated that pass would have _2 as suffix.

From what you’ve shared it looks like the following has changed:

  • Wifi is not in L1.1 or L1.2 anymore
  • NVME is not in L1.2 L1.2 anymore
  • L1.2 thresholds changed
  • The root port at 08.2 isn’t in D3 after resume.

If I was to guess without looking at code I think the L1.2 thresholds changing leads to wifi and NVME not going into L1.2 anymore and is the source of those problems.
I think it’s actually the same issue being discussed here: Re: [PATCH v5 4/4] PCI/ASPM: Fix L1.2 parameters when enable link state - David E. Box

That root port not in D3 after resume is surprising; it’s supposed to be by this quirk: linux/arch/x86/pci/fixup.c at master · torvalds/linux · GitHub

2 Likes

I think it might be because the NPU is now in the wrong state from that quirk. When the XDNA driver is loaded it should be fixed.

This is only compile tested, but see if it helps.

If it doesn’t help, can you please compare lspci again and also share me a kernel log after you’ve suspended/resumed.

1 Like

Tested and I don’t think it worked. Before suspend would be now 4.1w and after would be 4.7w. Also it was first time building my kernel with patch, I think I did it right but not 100% sure, so please let me know if I did anything incorrectly. (Downloaded your patch, point to it in kernel.spec in patch section, and then followed the fedora guidelines on how to build and install).
Here are all the logs and diffs.

Yeah definitely didn’t work. I think we should wait out the solution to the L1SS problem first. This is more likely to be the root cause.

1 Like

Cool. Thanks a lot!

Just interested in the topic: I’ve recently had the feeling that after some/many hours of usage my fw13’s idle power consumption might be higher, so the title of this topic felt like a possible explanation. So I repeated the same test.

I’m running Ubuntu 24.04 with latest kernel 6.8 from repos (31-generic), desktop is gnome with many extensions and I run the same command as the OP on alacritty. I get 3.4W before suspend and around 3.8W after suspend.

My kernel cmdline: root=****** ro splash quiet vt.handoff=1

I get the same first 3 changes: both mediatek wifi and nvme ssd (different model from Sabrent/Phison) are affected in the same way.

But in my case 08.2 is not D3 never, even after boot before resume. I get this both after and before suspend:

00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14eb (prog-if 00 [Normal decode])
	Subsystem: Device 0006:f111
	...
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	...
1 Like

Thanks, this reaffirms it’s an issue with L1SS.

I’ve been tracking similar observed behaviour through the last few kernels. The variance in battery draw from fresh boot vs after a couple of days without a reboot and multiple sleep cycles - this thread has been super helpful in confirming i’m not insane.

3 Likes

So something you can experiment with doing to confirm it’s L1SS is to use setpci to manually change fields that have changed and then check if that helps the power consumption.

Can you guys try 6.9? There are two interesting commits that might help the L1SS issue.

Done. It seems like the problem is solved!? :partying_face:.

Had several suspend cycles today and the idle power consumption stays at about 3,3 - 3,5 W what is until now only the case before the first suspend of the system. Hopefully the problem is finally gone…

1 Like

That’s great news!

1 Like

Same thing observed here. I just tried 6.9.0-363.vanilla.fc40.x86_64 and doesn’t observe the idle power consumption difference before and after suspend (triggered by amds2_idle.py) anymore. Using lspci, I am not observing the L1SubCtl1 discrepancy as well. :confetti_ball:

2 Likes

This is the commit that should be the reason it’s fixed.

6 Likes