[TRACKING] PPD changes have no effect after suspend on AMD until reboot

Trying a few games in Linux, I noticed that I’d get really bad performance occasionally. It turned out to be because the GPU would be starved for power. I think there might be a problem with /sys/firmware/acpi/platform_profile after a suspend / resume. Since this is what power-profiles-daemon writes to on this processor, it means that PPD changes seemingly have no effect after that.

You can see this most easily running ryzenadj -i in a watch, but it’s also obvious with glxgears and mangohud (and easier to set up on a livecd):

  • run MANGOHUD_CONFIG=gpu_core_clock,gpu_power,cpu_mhz,cpu_power vblank_mode=0 mangohud glxgears
  • On the balanced profile, should see a sum of GPU / CPU power of around 25-35W
  • Switch PPD to power save, watch the sum go down to 6-15W
  • While PPD is in power save, suspend the machine, wait a few seconds, resume
  • Wait a few seconds for it to finish boosting, and the sum to settle back down to 6-15W
  • Change PPD back to balanced or performance
  • Notice that the GPU / CPU power stays extremely low

I’m not sure this happens 100% of the time, but it’s been pretty consistent for me. I tried to rule out anything I did by trying a Fedora 39 livecd, and it reproduced there too.

With ryzenadj, you can see that the TDP limits normally change when PPD changes the profile, but after a suspend they get stuck at whatever they were when the machine was suspended – PPD changes won’t adjust them anymore.

This isn’t a PPD problem directly, you can reproduce the same thing by writing directly to /sys/firmware/acpi/platform_profile. It clears itself up after a reboot, but I haven’t found a way to fix it without rebooting.

Has anyone else seen this? Something I’m missing?

I wonder if this is related to a bug in amdgpu which locked memory clocks to lowest settings. I hit this in September on my 7900xtx and there was a patch issued for the kernel I Use on that system.

Are you using ppd with support for epp states or the mainline one in fc39?

I’m using a patched rawhide kernel now on my fw13 and haven’t noticed this.

Hm, it’s not always stuck low, though – if I suspended the machine while it was on the performance profile, the TDP limits would be stuck at whatever they were on performance.

This was with the mainline PPD. I also tried the EPP / multiple driver patch for PPD, and it still seemed to happen. Less of an impact, though. It seems specific to the acpi platform profile, EPP changes still seem to have some effect after suspend.

If this issue is reproducible it seems it may be an EC bug with state machine handling or APU ready. This is my most likely guess:

Perhaps that test in the function isn’t returning the right state after suspend and so the limits update doesn’t run.

@Kieran_Levin ^

2 Likes

Similar thing happening on Windows as well from my experience. Need the Framework team to address.

I observed the same problem on manjaro with Kernel 6.7rc2 and TLP.
After a fresh boot everything is working as expected. On AC with platform_profile set to performance (by TLP) i get a TDP up to 35W (measured with s-tui). Plugging out AC, TLP sets the platform_profile to powersave and the maximum TDP is around 15W. Plugging back in AC the platform_profile switched to performance and the TDP boosts up to 35W. After suspend the TDP always stays with 15W doesn’t matter if AC is connected or not although TLP sets the correct platform_profile. Only a reboot resolves this behavior.

1 Like

Hrm - this sounds familiar and it’s possible i’m not noticing it because I reboot when I notice the EC is confused for other reasons (namely Battery reporting / plugged unplugged doesn’t trigger power event triggers, things to report AC is still present when it’s not).

Wasn’t expecting a link to code! That’s awesome.

For now, I’ve been working around it by suspending for the first time when plugged in at the performance state to get it stuck on the highest cap. Using EPP states, it seems to handle power well enough even with a high TDP limit. I haven’t noticed it affect the battery life negatively, at least. I’m seeing 9-10 hours at low usage without any tuning besides the PPD patch, which is better than I expected.

Are you running with cros_ec_lpc patch ? I am wondering if there is something there with ectool that can poke whatever state isn’t clearing.

Nope, just a vanilla 6.6.2 Arch kernel (it does also reproduce on a Fedora livecd). You can also force the TDP limits to change with ryzenadj, but then they’re still stuck at whatever you force them to.

ectool console shows the messages logged from update_os_power_slider when you change the state before a suspend, and no messages logged when you change the state after a suspend, so it does seem likely that either that function or a parent is bailing out early.

Let’s get this into a ticket so we can escalate this to the appropriate team. Can you open a ticket and link to this thread, please. Once there, we can grab your ectool console output.

1 Like

Done, thanks!

1 Like

Hi @Justin_Weiss I am just trying to reproduce this watching with amdgpu_top;

I tried the following test sequences:

Boot with Power Connected(Profile set with PPD Balance_Performance epp hint; powersave govt).
Ran CIV6 GPU Bench - GFX_SCLK Boosts and Draw up to 45W observed.

Exit Game ;

Unplug - confirm PPD has set to Powersave with Power epp hint
Rerun CIV6 Bench - Note boost states but lower draw / clocks from GDX_SCLK (35W) - Degraded bench results

Exit Game;
Replug power - confirm shift to Balanced, and balanced_performance epp hint;
Rerun CIV6 GPU Bench - Observe higher boost states/Power Draw again.

Repeat Unplug - Suspend. Resume and re-run test plugged/unplugged.

Are you setting global platform to ‘Perfomance’ rather than Powersave and Balance_Performance ? That seems to be the delta . If So i’ll rerun and set the Performance platform profile (which is not recommended with amd-pstate ) and see if I can trigger it.

The epp hints seem to work fine, and the governor was always on powersave as recommended. The only part that doesn’t seem to work is the platform profile (/sys/firmware/acpi/platform_profile), which PPD writes to by default on this machine.

As far as I can tell, when you change the platform profile, the EC changes the TDP limits. You can see this by watching the output of ryzenadj -i (specifically PPT LIMIT FAST / PPT LIMIT SLOW), or looking for a message like [851320.018200 DC BATTERY SAVER] in ectool console (which is logged by a function in the file linked earlier in the thread).

The problem is that after a suspend, changing the platform profile no longer changes the TDP limits. The message is no longer logged, and ryzenadj will show the same TDP limits for the profile it was suspended with. You can still change the limits manually using ryzenadj. epp hints still seem to have an effect on performance and battery life, but only within the global TDP cap at whatever it had before suspend.

If you’re running the PPT patch that updates the epp hints as well as the platform profile, it’s probably not noticeable unless you suspended while in battery saver – the TDP limits for perf / balanced are well past the point of diminishing returns :slightly_smiling_face:

Ahh you’re running the Leonardo Gates / Ryzen SMU · GitLab

?

module for SMU reporting, as well as a patched kernel for the ec interface I guess from above comments?

Regards PPD ; Indeed the patched PPD updates the platform_profile between balanced and low-power consistently. I would need to build the ryzen-smu to get the PPT limits

What I get on f39 with a 6.6 kernel

aenertia@emiemi-3d-ae-net-nz:~$ cat /sys/firmware/acpi/platform_profile
low-power
aenertia@emiemi-3d-ae-net-nz:~$ cat /sys/firmware/acpi/platform_profile
balanced
aenertia@emiemi-3d-ae-net-nz:~/build/framework-ec/build/bds/util$ sudo ./ectool console
Missing Chromium EC memory map.
Cannot find I2C adapter
Unable to establish host communication
Couldn't find EC
aenertia@emiemi-3d-ae-net-nz:~/build/RyzenAdj/build$ sudo ./ryzenadj -i
[sudo] password for aenertia: 
CPU Family: Phoenix
SMU_SERVICE REQ_ID:0x3
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0xe, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU BIOS Interface Version: 14
Version: v0.14.0 
init_table
SMU_SERVICE REQ_ID:0x6
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0x4c0008, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REQ_ID:0x66
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0x9e300000, arg1:0xf, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
failed to get /sys/kernel/ryzen_smu_drv/pm_table: No such file or directory
failed to map /dev/mem: Operation not permitted
If you don't want to change your memory access policy, you need a kernel module for this task.
We do support usage of this kernel module https://gitlab.com/leogx9r/ryzen_smu
Unable to get memory access
Unable to init power metric table: -5, this does not affect adjustments because it is only needed for monitoring.


Nope, I couldn’t get the ryzen_smu module to load. From the issues it sounds like the 7000-series APUs aren’t supported yet. ryzenadj does work, at least for the power limits, if you add iomem=relaxed to the kernel parameters, though: Ryzen 7040 Phoenix APU series support · Issue #246 · FlyGoat/RyzenAdj · GitHub

For ectool, I just ran the build from here: _build/src · Artifacts · build linux/x64 (#883) · Jobs · Dustin L. Howett / ectool · GitLab

Worked fine without a patched kernel.

1 Like

Ahh interesting wrt ectool binary you linked. I was just building from the framework-ec github and that definately needs the kernel patch to work. I guess this one doesn’t rely on the cros_ec_lpc binding to work?

Let me reboot with your iomem=relaxed flag and see what I can find out.

Yup confirmed



1 Like

@Justin_Weiss I just sent your ticket to our technical escalations team. Thanks for your report on this.

3 Likes