FTR and FTH and stability

Hi,

FTR (Freeze then Reboot) and FTH (Freeze then Halt) are just some terms that have been used to describe, not easily reproduced, stability problems on FW13 and FW16 AMD mainboards.

For FW 16, we have discoverd that this can peak at 450 Watts for very short periods. Obviously, when this happens, the 180W PSU cannot supply that, so the FW16 dips into the battery for the extra power needed.

Many other mainboards use capacitors on the mainboard to suppy those peaks, instead of calling on the battery.

This makes me ask a question that i don’t know the answer to.
What is the reaction time of a battery compared to a capacitor with instataneous peak power draws?

My thought is that if a battery is slower, could this result in mini brownouts and thus maybe a source of instability?

Note: the above is speculation, hoping for someone to point me to some factual evidence either way.

It should be noted the FTH and FTR have been seen on fairly idle systems, say, just watching a youtube video, so power spikes seem an unlikely cause.

How often is this happening to you? I haven’t had any freezes on my 16 in a long time that I can recall.

I get them about once a month. Last FTR was yesterday, with no psu connected, on battery power only.

Come to think of it, I do get these freezes on occasion. I think I had one last when I disconnected the PSU to walk into another room.

I really hope hardware design flaws in the mainboard is not the reason Framework hasn’t given us a stable BIOS on this machine yet. That would mean a stable BIOS is impossible. But good luck getting Framework to admit that without a class action.

1 Like

I haven’t experienced this. I assume based on the wattage this may be due to the dedicated gpu?

Capacitors, especially tiny ones that are going to be on laptop motherboards, aren’t going to support a 450 watt load for anything longer than the tiniest fraction of a second; if you were watching it on a power meter you’d literally miss it if you blinked.

Lots of other manufacturers also use the battery as a crutch when the power supply isn’t able to supply enough power. Dell does it, HP does it, Lenovo does it, GPD does it… If you think about it, it makes a ton of sense especially when those spikes aren’t expected to last long. The only other options are either, 1. have the machine hard power off due to excess power usage, or 2. lower the performance such that there’s not a chance of it even spiking over the current the power supply is able to provide.

Neither of those options are ideal, though I’m guessing you’ll see #2 if running the board standalone without a battery connected.

1 Like

Ok, saying all of this of the top of my head, seeing it’s @James3 who done previous in depth analysis on this issue in the past… thought I say this now although I wanted to wait a bit longer ( early January )

So this FTR/FTH issue I’ve been experiencing since 3.04 through 4.01, since 4.02 I’ve not had a single FTR/FTH

I’ve done all my usual aggressiveness to try trigger this issue

main one for me is to have the laptop in AC or DC mode and external panel plugged in with youtube, system stat monitors and so showing on the secondary screen while the framework panel at 165Hz with VRR enabled is running a game in fullscreen ( Vulkan rendering )… yeah no FTR/FTH.

I’ve somewhat avoided saying anything because I didn’t want to get my hopes up but yeah… I can usually trigger this issue within 45 minutes… but it’s been 33 days ( up to yesterday ) and still no FTR/FTH

however… I have noticed two things

the average TDP when fully loaded is now 42W, not the usual 54 ( I think the BIOS change logs said something about this ) regardless of whether I’m in power-save, balanced_power, balanced_performance or “performance”, performance profiles ( not often I use power-save now unless I’m on DC ). I’m assuming the PTM7958 from Framework is good

the other thing I’ve noticed ( for months including previous BIOS versions but never said anything because I thought it was related to the 500Mhz issue and still trying to work out if it’s amd_pstate or BIOS issue ) sometimes the processor ( zen4 ) just refuses to go above 3.8GHz and doesn’t matter which performance profile I use. Rebooting is the only fix to this ( this is why I kept banging on about PROCHOT flags in the past as I thought it might have something to do with that ). Someone in 4.02 BIOS also mentioned they’ve experienced this ( first I heard of someone else saying this )

Honestly… was fucking shocked to see it didn’t crash ( FTR/FTH ) after 2 hours of gaming on framework screen while youtube and numerous other things going off on my other screen

I vented at Framework support and not responded to them since… but since BIOS 4.02 … may say something again but I want to make sure this isn’t some delusion I’m going through… but I’ve tried so dam hard to trigger this shit FTR/FTH thing… and nothing… 33 days and counting…

I’m now reconsidering Framework again ( I said in previous thread this is my last Framework product ) and that’s also because of ROCm 7.10 support on Strix Point

oh and in the previous post ( 4.02 ) I mentioned something about keyboard firmware not updating… yeah… it was due to lack of this in udev ( I’ll just show you the whole line but it’s the uaccess part )

  • Framework Laptop 16 - Keyboard

ATTR{idVendor}==“32ac”, ATTR{idProduct}==“0018”, ATTR{power/autosuspend}=“-1”, ATTR{power/control}=“on”, TAG+=“uaccess”

  • Framework Laptop 16 - Numpad

ATTR{idVendor}==“32ac”, ATTR{idProduct}==“0014”, ATTR{power/autosuspend}=“-1”, ATTR{power/control}=“on”, TAG+=“uaccess”

there’s another thread somwhere ( I think it’s pinned, don’t know where it is ) that also shows other Framework devices too about this uaccess thing ( not because of firmware updates but something else )

It’s good to say something on this forum that doesn’t make me feel like a bloody parrot around here

My current system configuration is

Framework 16 AMD Ryzen 7 7840HS using Radeon 780M ( no dGPU )
RAM/Memory: 128GB ( 2x64GB ) - Crucial 128GB Kit (64GBx2) DDR5-5600
NVME 2280: Western Digital SN850X 2TB - Firmware 620361WD
NVME 2230: Western Digital SN770M 2TB - Firmware 731120WD
BIOS: 4.02
Gentoo Linux 2.18 ( Linux 6.19-rc1 mainline PREEMPT_RT, compiled by clang 21.1.8 march and mtune set to znver4 )
KDE Plasma 6.4.5 Wayland

==== offtopic ====

now on to step two ( not Framework related issue ) why my bloody NCM865 Wi-Fi module fails to scan SSID’s ( currently using Intel AX210 for now )

Thank you for your detailed report.
I am still on 3.05, and need to wait for the updated EC source code before going further.
Comparing hardware.
I have the same no dGPU, sn850x, 7840HS, the rest is different.
As a side note, recently I have had similar FTR problems with an amd server, and simply swapping out the server CPU fixed it. I think that will be the next thing to try on the laptop.

by the way, I’m not saying because I don’t have this FTR/FTH issue anymore ( looks like it so far for me but still being cautious ), it’s gone, it’s just that I’ve been quite verbal here and github issues about this issue, parrot style… jezuz

This might be a little inconvenient, but have you tried totally avoiding suspend / resume and just doing full power offs and power ons every time?

My FW16 has been completely stable since I started doing that, and it definitely tends to crash every 1-3 days if I am doing suspend and resume things. If it’s not too inconvenient for you to try it out, it could (1) potentially give some information about what is actually causing the issue (2) potentially give you a fully stable system, albeit with a little annoyance about how you’re able to use it.

I hardly ever use suspend.
But that being said, does anyone have a tool that scans the logs, and captures, reboot and suspend counts, with time stamps when they occured.
We might get a pattern.

When I was experiencing the FTR/FTH issue, there was a higher chance of it happening during suspend/resume and switching between AC and DC but most of the FTR/FTH issues I was getting was just on AC with suspend/resume, like @James3 I too barely suspend

I found and updated a tool.
git clone GitHub - jcdutton/lastwake.py: Wake/Suspend Time SystemD Journal Analyzer [current boot]

I updated it to handle recognition of sleep s2idle.

to use it:
sudo ./lastwake.py

It needs sudo access to read the system logs.

So, now, when I see a crash, and I go back and see if it suspended at all that boot cycle.

Using that tool, I have found that the last FTR I had, was a clean cold boot, with NO suspend cycles.

Hi,

As you might expect, i have had various conversations with FW support about FTR and FTH.
Recently they sent me a replacement FW16 7840HS mainboard to see if it helps.
Some observations I have had so far:

  1. The new mainboard keeps quite strictly to the PMF: SPL setting, set by the EC. I.e 35W max power save, 40W max balanced, 45W performance.
  2. So much so in fact, that it has not yet gone above Tctl 82 C.
    The old mainboard would happily go over 45W and hit Tctl of 100 C.
  3. The BIOS on both boards is 3.05.
  4. The EC on both boards is my own EC firmware. My EC firmware is just like stock 3.05, except is has a few extra commands to view status of the EC better.
  5. the same SSD is used for both, so all the OS settings are identical across both mainboards.

Summary:
I think my old FW16 mainboard was kind of ignoring the PMF settings, and only taking Tctl 100 limit into account. The old mainboard was slower in balanced vs performance so maybe it was taking PMF into account, but would definitly hit Tctl 100 more often than the new mainboard.

So, my question is to others, people seeing FTR, FTH, do they also see Tctl 100?

I normally see the FTR, FTH about once a month, so will have to wait to see if that has changed.
I don’t think FTR,FTH is related to performance, because, for me the FTR,FTH happens when the laptop is not doing much. Just playing a video normally when it happens. Never doing enough performance wise to get the fans spinning at the time of the FTR.

Test method:
sudo ectool console
and look for lines like this:
[91.780000 PMF: SPL 40000mW, sPPT 48000mW, fPPT 58000mW, p3T 170000mW, ao_sppt 0mW]

The SPL is the limit set by the EC. It appears the Ryzenadj cannot change SPL, only the EC can do it. Each time you switch between Performance, Balances or Power Save, the EC will output another line for PMF. It also changes on unplug/plug of PSU.

To load the CPU, I use:
stress-ng --cpu 0 --timeout 600s

and to see the status, I use “amdgpu_top”.

It would help if you posted your method of testing and some examples. This will help others duplicate the same tests you are performing.

I, for example, am seeing a 35w CPU cap in performance power profile. Doesn’t make any sense.

I added test instructions above.
One thought i had for FTR. It seems to happen more when playing videos.
The most exercised component for video playback is of course RAM because it is an APU and main ram is used and changed for each video frame.
So, maybe FTR is caused by a bitflip on the ram chip, or on the path between ram and cpu, that goes undetected.

As the FTR is quite rare, and if the problem is ram or path to ram related, one would probably need to run “memtest86” for 30+ days and wait for it to fail. I don’t have a spare FW16 to test that on in order to prove my theory above.
But thinking about it. I might be able to craft a linux program that tests RAM in the backround, looking for bit flips.
For many things i do, i don’t need all ram, so could leave it running in the backround some of the time.
I might start with some cache‑eviction + access loops, related to rowhammer methods to try to reproduce it.

I have played around with various video playback. It turns out that playing 4K video. e.g. 3840 x 2160 videos does actually use more GPU and CPU according to “amdgpu_top”. This does raise the RAM temps to 72 C and reaches the SPL 45W limit of performance mode.
So, there are times when playing a video file does actually use a lot of CPU and GPU resources.
I will take note of the resolution of the video file being played, next time I get a FTR.

The previous 7840HS mainboard logs, as you can see FTR about every month or less:

zgrep "an uncorrected error caused a data fabric sync flood event"  syslog*
syslog.1:2026-01-08T19:26:09.704083+00:00 name kernel: x86/amd: Previous system reset reason S5_RESET_STATUS [0x08000800]: an uncorrected error caused a data fabric sync flood event
syslog.2.gz:2025-12-31T14:02:31.389587+00:00 name kernel: x86/amd: Previous system reset reason S5_RESET_STATUS [0x08000800]: an uncorrected error caused a data fabric sync flood event
syslog.4.gz:2025-12-15T13:10:53.979246+00:00 name kernel: x86/amd: Previous system reset reason S5_RESET_STATUS [0x08000800]: an uncorrected error caused a data fabric sync flood event

I have had the new 7840HS mainboard in a day or two now, no FTR on it so far.